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SLICED INVERSE REGRESSION WITH VARIABLE 
SELECTION AND INTERACTION DETECTION 

By Bo Jiang and Jun S. Liu 
Harvard University 

Variable selection methods play important roles in modeling high 

dimensional data and are keys to data-driven scientific discoveries. In 

?H ' this paper, we consider the problem of variable selection with interac- 

^~H, tion detection under the sliced inverse index modeling framework, in 

which the response is influenced by predictors through an unknown 
function of both linear combinations of predictors and interactions 
among them. Instead of building a predictive model of the response 
given combinations of predictors, we start by modeling the condi- 
tional distribution of predictors given responses. This inverse model- 
ing perspective motivates us to propose a stepwise procedure based on 
likelihood-ratio tests that is effective and computationally efficient in 
detecting interaction with little assumptions on its parametric form. 
The proposed procedure is able to detect pairwise interactions among 
p predictors with a computational time of 0(p) instead of 0(p 2 ) un- 
der moderate conditions. Consistency of the procedure in variable 
selection under a diverging number of predictors and sample size is 
established. Its excellent empirical performance in comparison with 
^ . some existing methods is demonstrated through simulation studies 

>0 ' as well as real data examples. 
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1. Introduction. Recently there has been a significant surge of inter- 
est in analytically accurate, numerically robust, and algorithmically efn- 

fT} ■ cient variable selection methods, largely due to the tremendous advance 

in data collection techniques such as those in genetics, internet, and mar- 
keting. The importance of discovering truly influential factors from a large 

k>( ■ pool of possibilities is now widely recognized by both general scientists and 

$_i . quantitative modelers. Under linear regression models, various regularization 

methods have been proposed for simultaneously estimating regression coef- 
ficients and selecting predictors. Many promising algorithms, such as Lasso 
(Tibshirani, 1996; Zou, 2006; Friedman et al, 2007), LARS (Efron et al., 
2004) and smoothly clipped absolute deviation (SCAD; Fan and Li (2001)), 
have been invented. When the number of the predictors is extremely large, 
Fan and Lv (2008) have proposed a sure independence screening (SIS) frame- 
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2 JIANG AND LIU 

work that first independently selects variables based on their correlations 
with the response and then applies variable selection methods in the second 
step. 

1.1. SIR with Variable Selection. When the relationship between the re- 
sponse Y and predictors X = (X\,X2, ■ ■ ■ ,X p ) T is beyond linear, the per- 
formance of the variable selection methods based on the linear model as- 
sumption can be severely compromised. In his seminal paper on dimension 
reduction, Li (1991) proposed a semi-parametric index model of the form 

(1.1) y = /(/3fX,/3fX,...,/3jX,e), 

where / is an unknown link function and e is a stochastic error independent 
of X. A sliced inverse regression (SIR) method was developed by Li (1991) 
to estimate the so-called sufficient dimension reduction (SDR) directions 
fli, . . . ,(3 q . For the ease of description, we standardize the predictors, Z = 

[Cov (X)]"2 [x - E (X)], and rewrite model (1.1) as 

(1.2) Y = f(rfiZ,rgZ,...,t%Z,e) with Vi = [Cov (X)]3ft. 

The SIR algorithm was motivated by the following observation: under model 
(1.2), if E (b T Z\r]J Z, . . . , rj^Z) is linear in rjfZ, . . . , ry^Z for any vector b G 
MP, then E (Z| Y) is contained in the linear subspace spanned by rj 1 , ... ,rj 
(Li, 1991). So if the first q largest eigenvalues of Cov (E (Z|Y)) are all posi- 
tive, SDR directions can be obtained by the corresponding eigenvectors. 

Eigenvalues of Cov(E(Z|Y)) also connects SIR with multiple linear re- 
gression (MLR). In MLR, the correlation squared, R 2 , can be defined as 

R 2 = max [Corr (Y. b T Z)l 2 , 
beiRp L v ' n 

while in SIR, the largest eigenvalue of Cov (E (Z| Y)), called the first profile- 
R 2 , can be defined as 

Ai (Cov (E(Z|Y))) = max max [Corr (T(Y), b T Z)] 2 , 

where maximum is taken over all bounded transformations T(-) and vectors 
beff (Chen and Li, 1998). We can further define the A;th profile- R 2 , \k 
(2 < k < q), as the kth largest eigenvalue of Cov(E(Z|Y)) by restricting 
the vector b to be orthogonal to eigenvectors of the first (k — 1) profile- R 2 . 
Since the estimation of SDR directions does not automatically lead to 
variable selection, various methods have been developed to perform dimen- 
sion reduction and variable selection simultaneously in the nonlinear setting. 
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SIR WITH INTERACTION DETECTION 3 

For example, Li, Cook and Nachtsheim (2005) used a backward subset se- 
lection method based on x 2 -tests derived in Cook (2004), and Li (2007) 
developed sparse SIR (SSIR) algorithm to obtain shrinkage estimates of the 
SDR directions under L\ norm. Motived by the F-test in stepwise regression 
and the connection between SIR and MLR, Zhong et al. (2012) proposed a 
stepwise variable selection procedure called correlation pursuit (COP) for 
index models under the SIR framework. 

1.2. Interaction Detection via Inverse Modeling. The number of vari- 
ables to be considered can be even larger with the inclusion of interaction 
terms between predictors. Consider the following simple example with a 
regression model for a univariate response variable Y and p independent 
normally distributed predictor variables X = (X\,X2, ■ ■ ■ ,X p ) T : 

(1.3) Y = X t X 2 + e, 

where X ~ MVN p (0,I p ) and e ~ JV(0, 0.1). Since there are (%) pairwise 
interactions, fitting regression models with 2- way interactions using variable 
selection methods, or even the sure independence screening procedure, is 
challenging when one has a moderate number of predictor variables, say p = 
1000. Recently, there has been considerable effort in fitting interaction mod- 
els in the statistical literatures. For example, Bien, Taylor and Tibshirani 
(2012) developed hierNet, an extension of Lasso to consider interactions 
in a model if one or both variables are marginally important (referred to 
as hierarchical interactions by the authors). Li, Zhong and Zhu (2012) pro- 
posed a sure independence screening procedure based on distance correlation 
(DC-SIS) that is shown to be capable of detecting important variables when 
interactions are presented. 

Most of the aforementioned methods are derived from a forward mod- 
eling perspective, that is, a model for the conditional distribution of Y 
given X. When predictor variables X can be treated as random, we ob- 
tain a different modeling perspective by "flipping" the roles of X and Y 
and putting the response Y behind the (conditioning) bar, which we call 
an inverse model. Indeed, this inverse modeling perspective has been taken 
by several researchers and has led to new developments in dimension reduc- 
tion and variable selection methods. Cook (2007) proposed inverse regres- 
sion models for dimension reduction, which have deep connections with the 
SIR method. Simon and Tibshirani (2012) proposed a permutation-based 
method for testing interactions by exploring the connection between the for- 
ward logistic model and the inverse normal mixture model when the response 
Y is binary. Another classical method derived from the inverse modeling per- 
spective is the Naive Bayes classifier for classifications with high dimensional 
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Fig 1. Left panel: contour plot for the joint distribution of Y and Xi in example (1.3). 
Right panel: conditional means and variances of Xi given slices ofY. Slices are indicated 
by different colors and round dots marks conditional means of X\ across slices. The con- 
ditional variances of Xi within different slices (from top to bottom) are: 2.29, 0.92, 0.41, 
0.98 and 2.33 



features. Although Naive Bayes classifier is limited by its strong indepen- 
dence assumption, it can be generalized by modeling the joint distribution 
of features. Murphy, Dean and Raftery (2010) proposed a variable selection 
method using Bayesian information criterion (BIC) for model-based discrim- 
inant analysis. Zhang and Liu (2007) proposed a Bayesian method called 
BEAM to detect epistatic interactions in genome- wide case-control studies, 
where Y is binary and X are discrete. 

The inverse modeling perspective can also shed lights on interaction de- 
tections. Figure 1 shows the contour plot for the joint distribution of Y and 
X\ from example (1.3). If we divide the response into five slices and calculate 
the mean and variance of X\ within each slice (as shown in Figure 1), we can 
see that although X\ is marginally uncorrelated with Y and the conditional 
means of X\ are the same, the conditional variances of X\ across slices are 
very different. As an illustrative example, we generated 200 observations 
from example (1.3) and divided the range of response into 5 slices with 40 
observations in each slice. First, we calcualted the following test static: 



H 



nD* = n log a] ~ S ^n h log 



h=\ 



a 



(h) 



,j = 1,2,..., p, 
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where p = 1000, H = 5, n = 200 and n h = 40, 



*?' 



is the estimated 



variance of Xj in the /ith slice, and a? is the estimated variance of Xj using 

all the observations. We found that ( nD\ I = 62.48 and ( nD^ ) = 56.03 
are significant compared with the null distribution of irrelevant predictors, 
which is asymptotically x 2 (8) (we have max 3e / 3 4 1000} \ n ^*j) = 28.46 in 
this particular example) . Even if we had not been able to select X2 in the 
first round using marginal information, we can recover it via a conditional 
test given X\ as follows: 



nD* j]{1} = nlogaJ| {1} - ^n^log 



" r n, 12 

j = 2,3,..., p, 



h=\ 

2 



d (h) 



where <?-|7il ^ s ^ e es ti ma ted variance by regressing Xj on X\ in the foth 

slice, and 5^, is the estimated variance by regressing Xj on X\ using all the 
observations. We detected X2 as an important predictor conditioning on X\ 
because f nD^un ) = 148.83 is highly significant compared with the null dis- 
tribution, which is asymptotically x 2 ( 12) (here max je { 3j4] 100 o} (nD*,^-, 1 = 

31.85). Instead of screening all the (?) = 0(p 2 ) pairwise interaction terms, 
we discovered the importance of X\ and X2 by examining the conditional 
distributions of predictors given the sliced response, which only requires a 
computational complexity of 0(p). Motivated by this observation, we in- 
vestigate an inverse modeling approach to tackle the problem of interaction 
detection in this paper. We will show that both (nD*j and (nD*,,^ J cor- 
respond to likelihood-ratio tests based on specific inverse models, and also 
describe a more general procedure to recursively select relevant predictors. 
The rest of the paper is organized as follows. In Section 2, we intro- 
duce an inverse model for the conditional distribution of X given slices 
of Y. SIR method is presented as a maximum likelihood estimate of the 
model. A likelihood-ratio test statistic for selecting relevant predictors is 
derived in Section 2.1, which is shown to be asymptotically equivalent to 
the COP statistics of Zhong et al. (2012). We further augment the model to 
detect interactions between predictors in Section 2.2. A sure independence 
screening criterion based on the likelihood-ratio test statistic is proposed in 
Section 2.3. By cross-stitching independence screening and likelihood-ratio 
tests, an iterative stepwise procedure that we referred to as SIRI is devel- 
oped in Section 3. Various implementation issues including the choices of 
slicing schemes and thresholds are also discussed. Simulations and real data 
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6 JIANG AND LIU 

examples are reported in Section 4 and 5. Additional remarks in Section 6 
conclude the paper. Proofs of the theorems are provided in Appendix A. 

2. Sliced Inverse Regression Model with Interaction Detection. 

Let Y G R be a univariate response variable and X = (X±,X2, ■ ■ ■ ,X p ) T £ 
W be a vector of p continuous predictor variables, {(x^, yi)}f =1 are indepen- 
dent observations of (X, Y). For discrete response, we can naturally group 
{z/iKLi i n to a finite number of classes. For continuous response, the range of 
{yi}™ =1 can be divided into H disjoint intervals, referred to as slices, which 
are denoted as S\, S2, ■ ■ ■ , Sh- We define a function S(Y) as the slice mem- 
bership of response Y, that is, S(Y) = h if Y E Sh- For a fixed slicing 
scheme, we denote n^ = \Sh\ = Shn where Ylh=i s h = 1- 

To take an inverse perspective on SIR, we start with a seemingly differ- 
ent model. We assume that the distribution of predictors given the sliced 
response is multivariate normal: 

(2.1) X\YeS h ~MVN(ji h ,E),l<h<H, 

where fi^ £ fi + V q belongs to a ^-dimensional affine space, V g is a q- 
dimensional subspace (q < p) and \i £ MP. Alternatively, we can write 
P'h = ^ + ^7^, where 7^ £ R 9 and r is a p by q matrix whose columns form a 
basis of the subspace V. Although this representation is only unique up to 
a orthogonal transformation on the bases T, the subspace V q is unique and 
identifiable. The following proposition proved by Szretter and Yohai (2009) 
links the inverse model (2.1) with SIR. 

Proposition 1. The maximum likelihood estimate (MLE) of the sub- 
space V q in model (2.1) coincides with the subspace spanned by SDR direc- 
tions estimated from the SIR algorithm. 

According to Proposition 1, we could have derived the SIR algorithm from 
an inverse model. Next, we provide a view of the COP method for selecting 
variables with marginal effects from the inverse model (2.1). This will lay the 
ground work for the augmented model to select variables with interaction 
effects in Section 2.2. 

2.1. Likelihood- Ratio Tests for Selecting Variables with Marginal Effects. 
For the purpose of variable selection, we partition predictors into two sub- 
sets: a set of relevant predictors indexed by A and a set of redundant pre- 
dictors indexed by A c , and assume the following model: 

(2.2) X A \YGS h ~ MVN(/i fe €/i + V«,E), 
X A c\*A,YeS h ~ MVN(a + /3 T X.4,£o). 
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SIR WITH INTERACTION DETECTION 7 

That is, we assume that the conditional distribution of relevant predictors 
follows the inverse model (2.1) of SIR and has a common covariance matrix 
in different slices. Given the current set of selected predictors indexed by C 
with dimension d and another predictor indexed by j ' ^ C, we propose the 
following hypotheses: 

H :A = C v.s. H x : A = C U {j}. 

Let Pmo (') an d Pmi (") denote the probability distributions by plugging in 
the MLE of model parameters under Hq and under H\, respectively. Then, 
the likelihood-ratio test statistic can be written as 



Li 



p Ml (xjy) = p Mi (x [cu{j - }]e |Xj,Xc,y) p Ml (Xj\Xc,y) 
i|c Pm (x|y) p Mo (x [cu{i}] c|Xj, x C; y) p A4o p^Xc, y) 
PMjXjIXc.y) 

p^j^-ixcy)' 

where the last equality follows from 

p Mx ( x [cu{j}] c l^i> x c,y) = Pm ( x [cu{j}] c l^i) X c,y) 

according to model (2.2). The scaled log-likelihood-ratio test statistic is given 
by 

(2-3) 5, |c = - log (L jlc ) = £ log [l + TT^TT J > 

where A^ and X k + are estimates of the fcth profile- i? 2 based on x^ and 

A d+1 — \ d p 
x Cu{j}> respectively Under the null hypothesis, k ~ d+1 k — > 0. Since log(l + 



t) ~ £ when r; is small, we have 

i,_i J- — At. 



p , „V^ X k ~ X k d o. 



fc=l x ' v fc 



Coincidentally, from an inverse model, we re-discovered the COP statistics 
proposed by Zhong et al. (2012), which are defined as 

COPf l = n Xk ~ d ^ k ,k = 1,2, ... , q, and COP^ 1 = ]T COPf l . 
1 ~ \ fe=i 
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For all the predictors indexed by j £ C c , we can also obtain the asymptotic 
joint distribution of inDj\ c \ under the null hypothesis: 



\fc = l / jeC c 



where z k = (z kj ) jeCc ~ MVN f 0, [Corr (X i ,X i |Xc)] ijeCC J and z fc 's are in- 
dependent. 

Furthermore, as n — > oo, 

Dj\C ^> £>i|C 

, / Var(M i ) - Cov (Mj, X c ) [Cov (Xc)]" 1 Cov (M,-, X c f N 
= lo g ^l + - 

where Mj = E (Xj|X c , S(Y)), Vj = Var (X,-|Xc, S(Y)) and S'(F) = /i when 
Y £ Sh (1 < h < H). Note that Vj does not depend on Y under the 
assumption in model (2.2). By Cauchy-Schwarz inequality and normality 
assumption, 

Dj\ c = iff E (Xj\Xc,Y £ S h )=E (Xj\X c ) ,l<h<H. 

That is, the test statistic Dj\c almost surely converges to zero if the condi- 
tional mean of Xj is independent of slice membership S(Y). Detailed proofs 
on properties of likelihood-ratio test statistic are delegated to Appendix A.l. 
Given thresholds v a > v^ and the current set of selected predictors indexed 
by C, we can select relevant variables by iterating the following steps until 
no new addition or deletion occurs: 

• Addition step: find j a such that Dj a \ c = maxj g c c ^j\C'i let C = C + {j a } 
if D jalc >u a . 

• Deletion step: find jd such that -D? d |c-{j d } = m i n jeC ^j\C-{j}i let C = 
C - {j d } if D jdlc _ {jd} < u d . 

Under model (2.2), for relevant predictors indexed by j £ A, we have 
(2.5) Xj\X A „ {j} , Y 6 5 k ~N [af ] + (3JX A ^ {j} ,a] lA _ {j} ) ,l<h<H. 

Let ctj(Y) = J2h=i a \ *(Y ^ Sh)- We introduce the following concept to 
study the marginal effect of relevant predictors. 
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Definition 1 (Marginally Detectable). We say a predictor indexed by 
j is marginally detectable if there exist constants n > and £ > such that 

(2.6) Var( aj {Y)) > £n~ re . 

Under the following conditions, the stepwise procedure proposed above is 
consistent for marginally detectable predictors by choosing the thresholds 
appropriately. 

Condition 1. There exist < r m j„ < r max < oo such that 

Tmin < A min (Cov(X|y G S h )) < \ max (Cov (X|Y G S h )) < T max , 

and that 

^max (Cov (X)) < T max , 

where X m i n (-) and A max (-) denote the smallest and largest eigenvalue of a 
positive definite matrix. 

Condition 2. p = 0(n p ) asn-s-oo with p > and 1p + 2k < 1, where 
k is the same constant as in (2.6). 

The following theorem is proved in Appendix A. 2. 

Theorem 1. // all the relevant predictors indexed by A are marginally 
detectable with constant k, then under model (2.2), Condition 1 and Condi- 
tion 2, there exists constant c > such that 



and 



as n —> oo . 



Pr ( min max D~\r > en K ) — > 1 

v C:C c n^0jec c ■" 



Pr [ max maxD,ic < — n K ] — > 1, 
v C:C c a4=0 jec c J ' 2 



Thus, if we choose the threshold v a = cn~ K and v^ = (c/2)n~ K with c and k, 
defined above, then the addition step will not stop selecting variables until 
all the relevant predictors have been included, and once all the relevant 
predictors have been included, all the redundant variables will be removed 
from the set of selected variables. 
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2.2. Interaction Detection Under the Augmented Model. Let us revisit 
the example (1.3) with the stepwise procedure based on the likelihood-ratio 
test statistic proposed in the previous section. As illustrated in Figure 1, 
we have E (X X \Y G S h ) = E (X 2 \Y G S h ) = for 1 < h < H. In the first 
addition step with C = 0, the stepwise procedure fails to capture either 
X\ or X2 since -Di|c=0 = D 2 \c=$ = 0- I n order to detect predictors with 
interactions, such as X\ and X2 in this example, we augment model (2.2) 
to a more general form, 

(2.7) *A\Y£S h ~ MVN(/z h ,S fc ), 

X^|X^,yG^ ~ MVN(a + /3 T X^,S ), 

which differs from model (2.2) in its allowing for slice-dependent means and 
covariance matrices for relevant predictors. 

Under model (2.7), a predictor indexed by j G A is conditionally irrelevant 
if the conditional distribution of Xj given Xj^_u\ and Y £ Sh does not 
depend on slice Sh- If there exists a conditionally irrelevant predictor indexed 
by j G A, then we can always redefine the index set of relevant predictors 
to be A — {j} in model (2.7). To guarantee identifiability, variables indexed 
by A in model (2.7) have to be minimally relevant, that is, A does not 
contain any conditionally irrelevant predictor. In Appendix A. 3, we proved 
the following proposition: 

Proposition 2. Minimally relevant predictors indexed by A in model 
(2.7) is unique under Condition 1. 

By following the same hypothesis testing framework for variable selection 
in the previous section, we can derive the scaled log-likelihood-ratio test 
statistic under the augmented model (2.7): 



H 



(2.8) 5 ;ic = iog^c-Ev log 



a j\c 



h=l 

2 



where C indexes currently selected predictors and j G C c , 



a j\c 



is the 



estimated variance by regressing Xj on Xc in slice Sh, and ai c is the es- 
timated variance by regressing Xj on Xc using all the observations. The 



augmented test statistic ( nD*, c j was used to select relevant predictors in 
the illustrative example of Section 1.2. 
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Under the assumption that A C C with \C\ = d, we can derive the exact 
and asymptotic distribution of ( nD*, c J : 

n* ili, Qo \ sr^n h f Q h /n h \ 

nD Jlc ~ nlog 1 + -2^V log U-g n/ 

V Eh=iQhJ h=1 n \J2h=iQh/nJ 

A x \{H-l){d + 2)), 

where Q ~ X 2 ((H-l)(d + l)) and Q A ~ X 2 (n h - (d+ 1)) (1 < ^ < A") are 
mutually independent according to Cochran's theorem. For all the predictors 
indexed by j G C c given predictors indexed by C, we can also obtain the 
asymptotic joint distribution of I nD*< c I under the assumption that A C C: 

/(ff-l)(d+l) H-l 

where Zj's and Zj's are mutually independent with 

z i = (^)jgcc ~ MVN (°> I Corr (^.^fcl x c)]j,^(.- / - 
and 

When the sample size n — > oo, 



I :,, ) , ,„ - MVN 0, [Corr 2 (Xj,X k \Xc) 



D*\c ^ D* {c 

( Var(M i ) - Cov (M,-, X c ) [Cov (Xg)]' 1 Cov (M,-, X C ) T \ 

og V E ^') J 

+ io g E(y J )-iEiog(y i ) 

where M,- = E (A^|X C , 5(F)), V,- = Var (Xj\Xc, S(Y)) and S(Y) = h when 
V € 5/i (1 < /i < H). According to Cauchy-Schwarz inequality and Jensen's 
inequality, 

D* lc = iff E (Xj\Xc, Y G S h ) = E (X,|Xc) and 
Var (Xj|Xc, Y G 5 h ) = Var (X,-|Xc) , 

for 1 < h < H. That is, the augmented test statistic D*, c almost surely 
converges to zero if both the conditional mean and the conditional variance 
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of Xj is independent of slice membership S(Y). Proofs on properties of the 
augmented likelihood-ratio test statistic are delegated to Appendix A. 4. 

A forward-addition backward-deletion algorithm similar to the stepwise 
procedure proposed in Section 2.1 can be used with the augmented likelihood- 
ratio test statistic D*, c . To investigate the detectability of the augmented 
likelihood-ratio test, we introduce the following concepts. 

Definition 2 (Conditionally Detectable). We say a collection of pre- 
dictors indexed by C2 is conditionally detectable given predictors indexed by 
C\ if C2 n C\ = 0, and for any set C satisfying C\ C C and C2 <jt C, there exist 
constants k > 0, £1,^2 > such that either 
(2.10) 

Var{Mj) - CW(M j; X c ) [Cov^c)]' 1 Cou(M i ,X c ) T 



max 

jeC c nCi 

or 



E(^) 



max [log (EVj) - Elog (Vj)} > £ 2 n~ K 
j£C c nCi 



> Zm- K , 



where Mj =E(X j \X e ,S(Y)) ) Vj = Var{X j \X c ,S(Y)). 

In other words, if the current selection C contains C%, then there always exist 
detectable predictors conditioning on currently selected variables until we 
include all the predictors indexed by €2- A relevant predictor Xj indexed 
by j ^ C2 is not conditionally detectable given C\ either because it is highly 
correlated with some other predictors, or its effect can only be detected 
when conditioning on predictors that have not been included in C\. Based 
on Definition 2, we define stepwise detectable recursively as following. 

Definition 3 (Stepwise Detectable). A collection of predictors indexed 
by 7o is said to be 0-level detectable if 'X.j- is conditionally detectable given 
an empty set, and a collection of predictors indexed by T m is said to be m- 
level detectable (m > 1) if X^ is conditionally detectable given predictors 
indexed by U™~[ T%. Finally, a predictor indexed by j is said to be stepwise 
detectable if j £ U^^. 

According to Lemma 1 in Appendix A. 2, given the same constant k, there 
exists £1 such that the set of marginally detectable predictors defined in 
Definition 1 is contained in 7o, the set of 0-level detectable predictors. As a 
result, the definition of stepwise detectable expand the concept of marginally 
detectable. In Appendix A. 5, we will show that by appropriately choosing 
thresholds, the stepwise procedure will keep adding predictors until all the 
stepwise detectable predictors have been included. 

imsart-aos ver. 2011/11/15 file: siri.tex date: April 16, 2013 



SIR WITH INTERACTION DETECTION 13 

Theorem 2. If all the relevant predictors indexed by A are stepwise 
detectable with constant k, then under model (2.2), Condition 1 and Condi- 
tion 2, there exists constant c* > such that as n — > oo, 



and 



Pr min maxD* lr > c*n K -)■ 1, 



Pr I max maxD*, r < — n K ) — > 1. 
VC:C c a4=0 jeC c J ' L 2 ) 

Thus, by appropriately choosing the thresholds, the stepwise procedure 
based on D*, c is consistent in identifying stepwise detectable predictors. 

2.3. A Sure Independence Screening Strategy. When dimensionality p is 
extremely large (e.g., exceeding n 2 ), the performance of the stepwise pro- 
cedure can be compromised. So we recommend adding an independence 
screening step to first reduce the dimensionality from ultra-high to moder- 
ately high. A natural choice of test statistic for the independence screening 
procedure is D*, c with C = 0, that is, 

H 2 



h=\ 



sf 



2 

where 



of 



is the estimated variance of Xj in slice Sh, and a 2 is the 
estimated variance of Xj using all the observations. If we rank predictors 
according to {-D|,l < j < p}, then a sure independence screening (SIS) 
procedure that takes the first 0(n) predictors has a high probability (almost 
surely) of including the independently detectable predictors defined below. 

Definition 4 (Independently Detectable). We say a predictor Xj is 
independently detectable if there exist constants k > and £i , £2 > such 
that either 

tom Var(E(X 3 \S(Y))) 

(2 - U) E(Var(X 3 \S(Y)))-^ n • 

or 

logM(Var(Xj\S(Y)))-Elog[Var(Xj\S(Y))] > &n~ K . 

Simply put, independently detectable predictors have either different means 
or different variances across slices. Therefore, in the example (1.3), both 
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X\ and X2 are independently detectable because Var(ATi|y G Sh) and 
Var(X2\Y € Sh) (1 < h < i?) are different across slices. 

In Theorem 3, we proved that the SIS procedure based on {£)* , 1 < j < p} 
almost surely includes the independently detectable predictors under the 
following condition with ultra-high dimensionality of predictors. 

Condition 3. log(p) = O^) as n ->■ 00 with < 7 + 2k < 1, where k 
is the same constant as in (2.11). Furthermore, the number of the relevant 
predictors \A\ < £o n?? with r\ + K < 1 and constant £0 > 0. 

We proved the following theorem in Appendix A. 6. 

Theorem 3. Under Condition 1 and Condition 3, if all the relevant 
predictors indexed by A are independently detectable, then there exist c > 
and C > such that 

Pr {mmD* > cn~ K ) -»■ 1 



jeA 3 



and 



Pr 



{j : D* > cnr K , l<j<p} 



<Cn K+ri ) ->1. 



According to Theorem 3, we can first use the SIS procedure to reduce the di- 
mensionality from p to a scale below sample size, say nj log(n), and then ap- 
ply the stepwise procedure proposed in the previous sections. Note that pre- 
dictors that are marginally or stepwise detectable according to Definition 1 
and Definition 3 are not necessarily independently detectable. Fan and Lv 
(2008) advocated an iterative procedure that alternate between a large-scale 
screening and a moderate-scale variable selection to enhance the perfor- 
mance, which will be discussed in the next section. 

3. Implementation Issues: Cross-Stitching and Cross- Validation. 

The simple model (2.2) and the augmented model (2.7) compensate each 
other in terms of the bias- variance trade-off. Given finite observations, model 
(2.2) is simpler and more powerful when the response is driven by some lin- 
ear combinations of covariates, while model (2.7) is useful in detecting more 
complex relationships such as heteroscedastic effects or interactions. Sim- 
ilarly, the SIS procedure introduced in the previous section can be used 
with a large number of predictors, but cannot pick up stepwise detectable 
predictors that have the same marginal distributions across slices. To find 
a balance between simplicity and detectability, we propose the following 
cross-stitching strategy: 
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• Step 0: start with the SIS procedure with currently selected predictors 
C = 0; 

• Step 1: select predictors by using the stepwise procedure with addi- 
tion and deletion steps based on Dj\c in (2.3) and add the selected 
predictors into C; 

• Step 2: select predictors by using the stepwise procedure with addi- 
tion and deletion steps based on D*, c in (2.8) and add the selected 
predictors into C; 

• Step 3: run the SIS procedure on the remaining predictors conditioning 
on the current selection C , and iterate Step 1-3 until no more predictors 
are selected. 

We name the proposed procedure Sliced Inverse Regression for Interaction 
Detection, or SIRI for short. An flowchart of the SIRI procedure is illustrated 
in Figure 2. 



start 



sure inde- 
pendence 
screening 



yes 



stepwise selection 
under simple 
model (2.2) 



stepwise selection 

under augmented 

model (2.7) 




Fig 2. Flowchart of SIRI 

In the addition step of the stepwise procedure, instead of selecting the 
variable from j £ C c with the maximum value of Dj\c ( or ^*j\c)i we m ay also 

sequentially add variables with Dj\c > v a (or D*, c > u*). Specifically, given 
thresholds u a > v<i and the current set of selected predictors indexed by C, 
we can modify each iteration of the original stepwise procedure as following: 

• Modified addition step: for each variable j € {1, . . . ,p}, let C = C + {j} 
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if j £C and Dj\ c > v a . 
• Deletion step: find jd such that Dj d \c-{jA = mrn 7'eC Dj\c-{j}'i ^ ^ = 
C - {j d } if D jdlc _ {jd} < u d . 

The stepwise procedure with the modified addition step may use fewer itera- 
tions to find all the relevant predictors and will not stop until all the relevant 
predictors have been included if we choose v a = cn~ K in Theorem 1. How- 
ever, in practice, the performance of the modified procedure depends on the 
ordering of the variables and is less stable than the original procedure. Since 
we are less concerned about the computational cost of SIRI, we implement 
the original addition step in the following study. 

There are some implementation issues that we have not discussed so far. 
First, we need to choose a slicing scheme. If we assume there is a true slicing 
scheme from which data are generated, we showed in Appendix A. 7 that the 
power of the stepwise procedure tends to increase with a larger number of 
slices, but there is no gain by further increasing the number of slices once 
the slicing is already more refined than the true slicing scheme. In practice, 
the true slicing scheme is usually unknown (except maybe in the case when 
the response is discrete). When a slicing scheme uses a larger number of 
slices, the number of observations in each slice decreases, which makes the 
estimation of parameters in the model less accurate and less stable. We 
observed from intensive simulation studies that, with a reasonable number 
of observations in each slice (say 40 or more), a larger number of slices is 
preferred. 

Second, we need to choose the number of effective directions q in model 
(2.2) and the thresholds in adding and deleting variables. Section 2.1 and 2.2 
characterize the asymptotic distributions and behaviors of stepwise proce- 
dures, and provide some theoretical guidelines for choosing the thresholds. 
However, these theoretical results are not directly usable because: (1) the 
asymptotic distributions that we derived in (2.4) and (2.9) are for a single 
addition or deletion step; (2) the consistency results are valid in asymptotic 
sense and the rate of increase in dimension relative to sample size is usu- 
ally unknown. In practice, we propose to use a K-iold cross-validation (CV) 
procedure for selecting thresholds and the number of effective directions q. 

We consider two performance measures for K-lold cross-validation: classi- 
fication error (CE) and mean absolute error (AE). Suppose there are n train- 
ing samples and m testing samples. The j'th observation (j = 1,2,... , m) 
in the testing set has response yj and slice membership S(yj) (the slicing 
scheme is fixed based on training samples). Let p- = Pryvf (S(yj) = /i|X = Xj) 
be the estimated probability that the observation j is from slice Sh, where 
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Ai denotes the maximum likelihood estimate of model parameters. The clas- 
sification error is defined as 






S(yj) / argmax (p) 
h v 



We denote the average response of training samples in slice Sh as 

y ~ n =1 ns( yi ) = h] 'h-M—U' 



The absolute error is defined as 

1 m 

m ^—* 

j=l 



h=l 



CE is a more relevant performance measure when the response is categorical 
or there is a non-smooth functional relationship (e.g., rational functions) 
between the response and predictors, and AE is a better measure when there 
is a monotonic and smooth functional relationship between the response and 
predictors. There are other measures that have compromise features between 
these two measures, such as median absolute deviation, which will not be 
explored here. We will use CE and AE as performance measures through 
out simulation studies and name the corresponding methods SIRI-AE and 
SIRI-CE, respectively. 

4. Simulation Studies. In order to facilitate fair comparisons with 
other existing methods that are motivated from the forward model perspec- 
tive, the examples presented here are all generated under forward models, 
which differs from the inverse model assumptions of SIRI. The setting of the 
simulation also demonstrates the robustness of SIRI when some of its model 
assumptions are violated, especially the normal distribution assumption on 
relevant predictor variables within each slice. 

We start with the comparison of independence screening methods in re- 
ducing the ultra- high dimensionality while retaining the relevant predictors. 
Then, we evaluate different variable selection methods under a variety of 
forward models including linear model, single- and multi-index models and 
models with different types of interactions. 

4.1. Independence Screening Performance. We first compare the vari- 
able screening performance of SIRI with iterative sure independence screen- 
ing (ISIS) based on correlation learning proposed by Fan and Lv (2008) and 
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sure independence screening based on distance correlation (DC-SIS) pro- 
posed by Li, Zhong and Zhu (2012). We evaluate the performance using the 
proportion that relevant predictors are placed among the top [n/log(n)] pre- 
dictors ranked by the corresponding method, with larger values indicating 
better performance in variable screening. 

In the simulation, the predictor variables X = (X\,X 2 , ■ ■ ■ ,X p ) T were 
generated from a p-variate normal distribution with mean and covariances 
Cov (Xi, Xj) = fy % ~i\ for 1 < i,j < p. We generate the response variable 
from the following three scenarios: 

Scenario 0.1 : Y = X 2 - pX 1 + 0.2Xi 00 + ere, 
Scenario 0.2 : Y = X X X 2 + ae 2|Xlool e, 

c • n o \r ^ 10 ° 
scenario 0.3 : Y = — — - + ere, 

Xi+X 2 

where sample size n = 200, a = 0.2 and e is iV(0, 1) and independent of 
X. For each scenario, we consider four different settings with dimension 
p = 2000 or 5000 and correlation p = 0.0 or 0.5. Scenario 0.1 is a linear 
model with three additive effects. The way X\ is introduced is to make it 
marginally uncorrelated with the response Y (note that when p = 0.0, X\ is 
not a relevant predictor). We added another variable X\$q that has negligible 
correlation with X\ and X 2 and a very small correlation with the response 
Y. Scenario 0.2 contains an interaction term X\X 2 and a heteroscedatic 
noise term determined by Xioo- Scenario 0.3 is an example of a rational 
model with interactions. 

Proportions that relevant predictors are placed among the top [n/log(n)] 
by different screening methods are shown in Table 1. Under Scenario 0.1 with 
linear models, we can see that ISIS and DC-SIS had better power than SIRI 
in detecting variables that are weakly correlated wth the response (Xioo in 
this example). When the correlation between predictors p = 0.5 (Setting 2 
and 4), iterative procedures, ISIS and SIRI, were more effective in detecting 
predictors that are marginally uncorrelated with the response (X\ in this 
example) compared with DC-SIS. Under Scenario 0.2, ISIS based on linear 
models failed to detect the interaction term and often misses the predictor 
in the heteroscedastic noise term. When there are moderate correlations 
between two predictors X\ and X 2 in the interaction term (Setting 2 and 4), 
DC-SIS picked up X\ and X 2 about half of the time. However, when the two 
predictors are uncorrelated (Setting 1 and 3), DC-SIS failed to detect them 
most of the time. SIRI outperformed DC-SIS in detecting interactions for 
both settings with p = 0.0 and p = 0.5. Under Scenario 0.3, when there is a 
rational relationship between the response and the relevant predictors, SIRI 
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Table 1 

The proportions that relevant predictors are placed among the top [n/ log(n)] by different 

screening methods under Scenarios 0.1-0.3 in Section 4-1- 



Method 


Scenario 0.1 


Scenario 0.2 


Scenario 


0.3 




Xl X'2 


X\()Q 


Xi 


x 2 


X W Q 


X 1 


x 2 


-Xioo 


Setting 1: 


p == 2000, p = 


= 0.0 














ISIS 


1.00 


1.00 


0.02 


0.01 


0.46 


0.00 


0.00 


0.09 


DC-SIS 


1.00 


0.55 


0.07 


0.09 


1.00 


0.00 


0.00 


0.60 


SIRI 


1.00 


0.30 


0.32 


0.25 


0.97 


1.00 


0.99 


1.00 


Setting 2: 


p == 2000, p = 


= 0.5 














ISIS 


1.00 1.00 


1.00 


0.04 


0.02 


0.54 


0.00 


0.00 


0.15 


DC-SIS 


0.02 1.00 


0.71 


0.55 


0.53 


1.00 


0.03 


0.00 


0.59 


SIRI 


1.00 1.00 


0.45 


0.92 


0.87 


0.92 


1.00 


1.00 


1.00 


Setting 3: 


p == 5000, p = 


= 0.0 














ISIS 


1.00 


1.00 


0.02 


0.00 


0.43 


0.00 


0.00 


0.06 


DC-SIS 


1.00 


0.39 


0.03 


0.05 


1.00 


0.00 


0.00 


0.44 


SIRI 


1.00 


0.14 


0.15 


0.16 


0.99 


0.99 


1.00 


1.00 


Setting 4: 


p == 5000, p = 


= 0.5 














ISIS 


1.00 1.00 


1.00 


0.03 


0.02 


0.60 


0.00 


0.00 


0.07 


DC-SIS 


0.05 1.00 


0.71 


0.41 


0.44 


1.00 


0.00 


0.02 


0.61 


SIRI 


1.00 1.00 


0.39 


0.88 


0.86 


0.94 


0.98 


1.00 


0.99 



significantly outperformed the other two methods in detecting the relevant 
predictors. Performances of different methods are only slightly affected as 
we increase the dimension from p = 2000 to p = 5000. 

4.2. Variable Selection Performance. We further study the variable se- 
lection accuracy of SIRI and other existing methods in identifying relevant 
predictors and excluding irrelevant predictors. In the following examples, for 
both SIRI and COP, we implemented a fixed slicing scheme with 5 slices of 
equal size (i.e., H = 5) and used a 10-fold CV procedure to determine the 
stepwise variable selection thresholds and the number of effective directions 
q in model (2.2) of Section 2.1. Specifically, the number of effective direc- 
tions q was chosen from {0, 1, 2, 3, 4}, where q = means that we skipped the 
variable selection step under simple model (2.2) in the iterative procedure 
described by Figure 2. The thresholds in addition and deletion steps were 
selected from the grid {(fj, a = X 2 ( a «)<7)> v i,d = X 2 ( a i ~ 0.05, q))} for simple 
model (2.2) and from the grid {(i/? a = n _^ d+2) X 2 K (H - l)(d + 2)), v* d = 
n _ H n ( d+ 2) X 2 (®i - 0.05, (H — l)(d + 2)))} for augmented model (2.7), where 
X 2 (a,d.f.) is the lOOath quantile of x 2 (d.f.) and d = \C\ is the number of 
previously selected predictors. For a given p, the dimension of predictors, we 
chose {ai} = {1 - p- 1 , 1 - 0.5p-\ 1 - O.lp- 1 , 1 - 0.05p-\ 1 - O.Olp- 1 }. 

imsart-aos ver. 2011/11/15 file: siri.tex date: April 16, 2013 



20 JIANG AND LIU 

The other variable selection methods to be compared with SIRI and COP 
include Lasso, ISIS-SCAD (SCAD with iterative sure independence screen- 
ing), and hierNet, which is a Lasso-like procedure to detect multiplicative 
interactions between predictors under hierarchical constraints. The R pack- 
ages glmnet, SIS, COP and hierNet are used to run Lasso, ISIS-SCAD, COP 
and hierNet, respectively. For Lasso and hierNet, we select the largest reg- 
ularization parameter with estimated CV error less than or equal to the 
minimum estimated CV error plus one standard deviation of the estimate. 
The tuning parameters SCAD are also selected by CV. 

For variable selections under index models, we generated the predictor 
variables X = (X\, A2, ■ ■ ■ ,X p ) T from a multi-variate normal distribution 
with mean and covariances Cov(Aj,Aj) = p' l ~i' for 1 < i,j < p , and 
simulated the response variable according to the following models: 

Scenario 1.1 : Y = j3 T X + ae, n = 200, a = 1.0, p = 0.5, 
/3 = (3,1.5,2,2,2,2,2,2,0,... ,0), 

c • 1 1 v z2j=l X 3 

Scenario 1.2 : 1 = -, h ere, 

0.5 + (1.5 + £* *i) 2 



Scenario 1.3 : Y 



■3- 

n = 200,er = 0.2,p = 0.0, 
at 



l-5 + E?=i*J 



n = 1000,o- = 0.2,/9 = 0.0, 

where n is the number of observations, p is the number of predictors and is 
set as 1000 here, and the noise e is independent of X and follows A r (0, 1). 
Scenario 1.1 is a linear model which involves 8 true predictors and 992 irrel- 
evant predictors. Scenario 1.2, a multi-index model with 4 true predictors, 
was studied in Li (1991) and Zhong et al. (2012), and there is a non-linear 
relationship between the response Y and two linear combinations of predic- 
tors Ai + X2 + A3 and X2 + A3 + A4. Scenario 1.3 is a single- index model 
with 8 true predictors and heteroscedastic noise. 

For each simulation setting, we randomly generated 100 data sets each 
with n observations and applied variable selection methods to each data set. 
Two quantities, the average number of irrelevant predictors falsely selected 
as true predictors (which is referred to as FP) and the average number of 
true predictors falsely excluded as irrelevant predictors (which is referred 
to as FN), were used to measure the variable selection performance of each 
method. For example, under Scenario 1.1, the FPs and FNs range from 
to 992 and from to 8, respectively, with smaller values indicating better 
accuracies in variable selection. The FP- and FN-values of different methods 
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together with their corresponding standard errors (in brackets) are reported 
in Table 2. 

Table 2 
False positive (FP) and false negative (FN) values of different variable selection methods 

under Scenario 1.1-1.3. 



Method 


Scenario 1.1 


Scenario 1.2 


Scenario 1.3 




FP (0, 992) 


FN (0, 8) 


FP (0, 996) 


FN (0,4) 


FP (0,992) 


FN (0, 8) 


Lasso 


0.59 (0.10) 


0.00 (0.00) 


0.08 (0.03) 


1.07 (0.03) 


0.00 (0.00) 


8.00 (0.00) 


ISIS-SCAD 


0.35 (0.07) 


0.00 (0.00) 


0.60 (0.08) 


1.02 (0.01) 


5.08 (0.65) 


7.97 (0.02) 


hierNet 


0.59 (0.10) 


0.00 (0.00) 


8.65 (0.36) 


0.93 (0.03) 


7.66 (0.48) 


7.94 (0.02) 


COP 


0.69 (0.12) 


0.06 (0.03) 


1.84 (0.16) 


0.98 (0.01) 


1.26 (0.13) 


3.32 (0.19) 


SIRI-AE 


0.01 (0.01) 


0.09 (0.04) 


0.13 (0.04) 


0.07 (0.03) 


0.43 (0.08) 


4.82 (0.27) 


SIRI-CE 


0.26 (0.05) 


0.08 (0.03) 


0.55 (0.08) 


0.09 (0.03) 


2.02 (0.17) 


0.51 (0.16) 



Under Scenario 1.1, variable selection methods derived from linear models 
(Lasso, SCAD and hierNet) were able to detect all the relevant predictors 
(FN=0) with few false positives. On the other hand, COP, SIRI-AE and 
SIRI-CE missed some (about 10%) relevant predictors while excluded most 
irrelevant ones (lower FP vaues). The relatively high accuracy of methods 
developed for linear models is expected under this scenario, because the 
observations were simulated from a linear relationship. Under Scenario 1.2, 
Lasso achieved the lowest false positives, but it almost always missed one 
of the relevant predictor, X4, because of its non-linear relationship with 
the response. The other methods developed under the linear model assump- 
tion suffered from the same issue. However, SIRI-AE and SIRI-CE was able 
to detect most of the four relevant predictors (FN=0.09 and 0.07) with a 
comparable number of false positives. Under the heteroscedastic model in 
Scenario 1.3, the methods based on linear models failed to detect relevant 
predictors most of the time. Among other methods, SIRI-AE achieved the 
lowest number of false positives (FP=0.43) but missed about half of the 
relevant predictors (FN=4.82), while SIRI-CE selected most of the relevant 
predictors (FN=0.51) with a reasonably low false positives (FP=2.02). The 
performance of COP was in-between SIRI-AE and SIRI-CE with FN=3.32 
and FP=1.26. A possible explanation for the better performance of SIRI-CE 
relative to SIRI-AE in this setting is because the generative model under 
Scenario 1.3 contains a singular point at X^=i -^-3 = — 1-5- Since the absolute 
error is less robust to outliers than the classification error, SIRI-AE is more 
sensitive to the inclusion of irrelevant predictors and more conservative in 
selecting predictors. 

Next, we consider forward models containing different types of interac- 
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tions. Predictor variables X\,X 2 , ■ ■ ■ ,X p were independent and identically 
distributed iV(0, 1) random variables, and the response was generated under 
the following models given the predictors: 



Scenario 2.1 
Scenario 2.2 
Scenario 2.3 
Scenario 2.4 
Scenario 2.5 



Y = X X X 2 + ae, n = 200, 

Y = X x + XxX 2 + X X X 3 + ae, n = 200, 

Y = XxX 2 + X X X 3 +ae, n = 200, 

Y = XxX 2 X 3 + ere, n = 200, 500 and 1000, 

Y = X\X 2 + ae, n = 200, 



V 

Scenario 2.6 : Y = — — + ae, n = 200, 

X 2 + A3 

where n is the number of observations, p is the number of predictors and 
is set as 1000 here, a = 0.2 and e is independent of X and follows N(0, 1). 
Scenario 2.1 and Scenario 2.3 contain predictors with pairwise multiplicative 
interactions and without main effects. The model under Scenario 2.2 has hi- 
erarchical interaction terms {Xx has main effect). The three-way interaction 
model in Scenario 2.4 was simulated under three settings with different sam- 
ple sizes: n = 200, n = 500 and n = 1000. Scenario 2.5 contains a quadratic 
interaction term and Scenario 2.6 has a rational relationship. 

Because methods such as Lasso, SCAD and COP are not specifically de- 
signed for detecting interactions and are clearly at a disadvantage, we did 
not directly compare them with SIRI and hierNet. For the purpose of com- 
parison, we created a benchmark method based on ISIS-SCAD by applying 
ISIS-SCAD to an expanded set of predictors that includes all the terms up 
to k-way multiplicative interactions. The corresponding method, which we 
referred to as ISIS-SCAD-fc, is an oracle benchmark under Scenario 2.1-2.4 
when responses were generated according to 2-way or 3-way multiplicative 
interactions. Since DC-SIS as a screening tool has the ability to detect indi- 
vidual predictors under the presence of interaction effects, we also augmented 
ISIS-SCAD with DC-SIS and denoted the method as DC-SIS-SCAD-A:. In 
DC-SIS-SCAD-/c, we first used DC-SIS to reduce the number of predictors 
from p to [n/log(n)]. Then, we expanded the selected predictors to include 
up to k-way multiplicative interactions among them and applied ISIS-SCAD. 
Because DC-SIS-SCAD-A: does not need to consider all the interaction terms 
among p predictors, it has a huge speed advantage over ISIS-SCAD-/c but it 
may fail to detect all the predictors if the DC-SIS step does not retain all the 
relevant predictors. The FP- and FN-values (and their standard errors) of 
different methods including ISIS-SCAD-A; and DC-SIS-SCAD-A; under vari- 
ous scenarios are shown in Table 3, Table 4 and Table 5, respectively. Note 
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that FP- and FN- values are calculated based on the number of predictors se- 
lected by a method, not based on the number of parameters used in building 
the model. For example, if X3, X4 and A3X4 all have non-zero coefficients 
from hierNet under Scenario 2.1, we count the number of false positives as 
2, not 3. 

Table 3 
False positive (FP) and false negative (FN) values of different variable selection methods 

under Scenario 2.1-2.3. 



Method 


Scenario 2.1 


Scenario 2.2 


Scenario 2.3 




FP (0, 998) 


FN (0,2) 


FP (0, 997) 


FN (0, 3) 


FP (0, 997) 


FN (0, 3) 


ISIS-SCAD-2 


0.00 (0.00) 


0.00 (0.00) 


0.00 (0.00) 


0.06 (0.04) 


0.00 (0.00) 


0.03 (0.03) 


DC-SIS-SCAD-2 


0.00 (0.00) 


0.00 (0.00) 


0.25 (0.09) 


0.11 (0.03) 


1.56 (0.19) 


1.81 (0.11) 


hierNet 


2.38 (0.33) 


0.00 (0.00) 


6.93 (0.56) 


0.14 (0.05) 


6.98 (0.57) 


0.12 (0.05) 


SIRI-AE 


0.01 (0.01) 


0.00 (0.00) 


0.02 (0.01) 


0.04 (0.02) 


0.10 (0.04) 


0.11 (0.05) 


SIRI-CE 


0.76 (0.13) 


0.00 (0.00) 


0.29 (0.06) 


0.10 (0.04) 


0.86 (0.12) 


0.11 (0.05) 



Under Scenarios 2.1-2.3 of Table 3, the oracle benchmark, ISIS-SCAD-2, 
correctly discovered most of the relevant predictors in the two-way inter- 
actions and did not pick up any irrelevant predictor. It is encouraging to 
see that the performance of the proposed method SIRI-AE was compara- 
ble with ISIS-SCAD-2 (in terms of both false positives and false negatives), 
although SIRI-AE did not assume the knowledge on the generative model. 
Moreover, since both ISIS-SCAD-2 and hierNet considered all the pairwise 
interactions between p predictor variables, they have computational com- 
plexity 0(np 2 ) with p = 1000 and need much more computational resources 
compared with SIRI. On average ISIS-SCAD-2 and hierNet are more than 
100 times slower than SIRI (see Table 6 for running time comparison of 
different methods). While we can dramatically increase the computational 
speed by using DC-SIS to screen variables before applying more refined vari- 
able selection methods, relevant predictors may be incorrectly filtered by the 
DC-SIS procedure as shown by DC-SIS-SCAD's higher false negatives under 
Scenario 2.3 of Table 3. 

Under Scenario 2.4 with three-way interactions, the computational cost 
prevents us from directly applying ISIS-SCAD-3 to consider all the three- 
way interaction terms. So we only compared the performance of ISIS-SCAD- 
3 after variable screening using DC-SIS, that is, DC-SIS-SCAD-3 in Table 4. 
DC-SIS-SCAD-3 performed best under different sample sizes as it assumed 
the form of the underlying generative model. Among other methods, the 
performance of SIRI-AE improved relatively to hierNet as sample size in- 
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Table 4 

False positive (FP) and false negative (FN) values of different variable selection methods 

under Scenario 2.4 with different sample sizes. 



Method 

DC-SIS-SCAD-3 
hierNet 
SIRI-AE 
SIRI-CE 



Scenario 2.4 (n = 200) Scenario 2.4 (n = 500) Scenario 2.4 (n = 1000) 



FP (0, 997) FN (0, 3) FP (0, 997) FN (0, 3) FP (0, 997) FN (0, 3) 



0.45 (0.12) 
7.22 (0.64) 
0.98 (0.12) 
1.98 (0.16) 



0.85 (0.12) 
2.41 (0.08) 
2.27 (0.06) 
2.27 (0.07) 



0.00 (0.00) 
7.73 (1.17) 
0.36 (0.09) 
1.96 (0.17) 



0.00 (0.00) 
2.38 (0.08) 
0.70 (0.07) 
0.46 (0.05) 



0.00 (0.00) 
4.25 (1.17) 
0.21 (0.06) 
2.03 (0.19) 



0.00 (0.00) 
2.62 (0.06) 
0.00 (0.00) 
0.00 (0.00) 



creased. When sample size n = 1000, SIRI-AE was able to select all the 
relevant predictors with very low false positives. 

Table 5 
False positive (FP) and false negative (FN) values of different variable selection methods 

Scenario 2.5 and 2.6. 



Method 

ISIS-SCAD-2 

DC-SIS-SCAD-2 

hierNet 

SIRI-AE 

SIRI-CE 



Scenario 2.5 



Scenario 2j 



FP (0, 998) FN (0, 2) FP (0, 997) FN (0, 3) 



0.04 (0.02) 
2.38 (0.18) 
2.42 (0.44) 
0.08 (0.03) 
0.88 (0.11) 



1.09 (0.04) 
0.51 (0.05) 
0.88 (0.05) 
0.00 (0.00) 
0.01 (0.01) 



0.00 (0.00) 
0.81 (0.16) 
5.71 (0.59) 
0.51 (0.11) 
0.56 (0.11) 



3.00 (0.00) 
2.96 (0.02) 
2.91 (0.03) 
0.00 (0.00) 
0.00 (0.00) 



Table 6 

Average running time (in seconds) of different variable selection methods under 

Scenarios 2.1-2.3, 2.5 and 2.6. 



Method 


Scenario 2.1 


Scenario 2.2 


Scenario 2.3 


Scenario 2.5 


Scenario 2.6 


ISIS-SCAD-2 


13890.86 


9406.27 


11581.55 


10232.31 


4220.24 


DC-SIS-SCAD-2 


29.24 


25.77 


31.90 


37.03 


25.68 


hierNet 


15213.01 


26171.28 


34733.13 


37312.59 


27255.16 


SIRI 


27.96 


44.85 


20.01 


44.36 


35.26 



Simulations in Scenarios 2.1-2.4 were generated under the same model 
assumption as ISIS-SCAD-fc and DC-SIS-SCAD-fc, which gives them ad- 
vantage in the comparisons. Under Scenarios 2.5 and 2.6 of Table 5, when 
the generative model goes beyond multiplicative interactions, we can see 
that SIRI-AE and SIRI-CE significantly outperformed other methods in de- 
tecting relevant predictors with low false positives. 
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5. Real Data Examples. We applied SIRI to two real data examples. 
The first example studies the problem of leukemia subtype classification 
with ultra-high dimensional features. In the second example, we treat gene 
expression level in embryonic stem cells as a continuous response variable, 
and are interested in selecting regulatory factors that interact with DNA 
and other factors to determine expression patterns of genes. 

5.1. Leukemia Classification. For the first example, we applied SIRI-CE 
to select features for the classification of a leukemia data set from high den- 
sity Affymetrix oligonucelotide arrays (Golub et al., 1999) that have been 
previously analyzed by Tibshirani et al. (2002) using a nearest shrunken 
centroid method and by Fan and Lv (2008) using a SIS-SCAD based linear 
discrimination method (SIS-SCAD-LD). The data set consists of 7129 genes 
and 72 samples from two classes: ALL (acute lymphocytic leukemia) with 
47 samples and AML (acute mylogenous leukemia) with 25 samples. The 
data set was divided into a training set of 38 samples (27 in class ALL nad 
11 in class AML) and a test set of 34 samples (20 in class ALL and 14 in 
class AML). 

Table 7 
Leukemia classification results 

Method Training error Test error Number of genes 
SIRI-CE 0/38 1734 8 
SIS-SCAD-LD 0/38 1/34 16 
Nearest Shrunken Centroid 1/38 2/34 21 

The classification results of SIRI-CE, SIS-SCAD-LD and nearest shrunken 
centroids method are shown in Table 7. The results of SIS-SCAD-LD and the 
nearest shrunken centroids method were extracted from Fan and Lv (2008) 
and Tibshirani et al. (2002), respectively. SIRI-CE and SIS-SCAD-LD both 
made no training error and one test error, whereas the nearest shrunken cen- 
troids method made one training error and two test errors. Comparing with 
SIS-SCAD-LD, SIRI used a smaller number of genes (8 genes) to achieve 
the same classification accuracy. 

5.2. Identifying Regulating Factors in Embryonic Stem Cells. The mouse 
embryonic stem cells (ESCs) data set has previously been analyzed by 
Zhong et al. (2012) to identify important transcription factors (TFs) for 
regulating the expression of genes. The response variable, expression levels 
of 12408 genes, was quantified using RNA-seq technology in mouse ESCs 
(Cloonan et al., 2008). To understand the ESC development, it is impor- 
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tant to identify key regulating TFs, whose binding profiles on promoter 
regions are associated with corresponding gene expression levels. To extract 
features that are associated with potential gene regulating TFs, Chen et al. 

(2008) performed ChlP-seq experiments on 12 TFs that are known to play 
different roles in ES-cell biology as components of the important signaling 
pathways, self-renewal regulators, and key reprogramming factors. For each 
pair of gene and one of these 12 TFs, a score named transcription factor as- 
sociation strength (TFAS) that was proposed by Ouyang, Zhou and Wong 

(2009) was calculated. In addition, Zhong et al. (2012) supplemented the 
data set with motif matching scores of 300 putative mouse TFs compiled 
from the TRANSFAC database. The TF motif matching scores were cal- 
culated based on the occurrences of TF binding motifs on gene promoter 
regions (Zhong et al., 2005). The data consists of a 12408 x 312 matrix with 
(z, j)th entry representing the score of the jth. TF on the ith gene's promoter 
region. 

Table 8 
The ranks of 12 known ES-cell TFs (among 312 predictors) using SIRTAE and COP 



TF names 


Ranks 






SIRI-AE 


COP 


E2fl 


1 


1 


Zfx 


3 


3 


Mycn 


4 


10 


Klf4 


5 


19 


Myc 


6 


- 


Esrrb 


8 


- 


Oct4 


9 


11 


Tcfcp211 


10 


36 


Nanog 


14 


- 


Stat3 


17 


20 


Sox2 


18 


- 


Smadl 


32 


13 



In Zhong et al. (2012), the method COP selected a total of 42 predictors, 
which include 8 out of 12 TFASs and 34 out of 300 TF motif scores. Here, 
we used SIRI-AE to re-analyze the mouse ESCs data set and selected 34 
predictors, which include all the 12 TFASs and 22 TF motif matching scores. 
The relative ranks of 12 TFASs from SIRI-AE and COP are shown in Table 8. 
Among the top-10 TFs ranked by SIRI-AE, 8 of them are known ES-cell TFs. 
SIRI-AE is also able to identify Nanog and Sox that are generally believed 
to be the master ESC regulators but were missed in the results of COP. A 
further study of the top-ranked TFs whose roles in ES cells have not been 
studied, such as AP2 (ranked 11 by SIRI), PAX (ranked 19 by SIR!) and 
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SP1 (ranked 22 by SIRI), could help us better understand transcriptional 
regulatory networks in embryonic stem cells. 

6. Concluding Remarks. We study the problem of variable selection 
in high dimensions from an inverse modeling perspective. The contributions 
of the proposed procedure that we named SIRI is twofold. First, it is ef- 
fective and computationally efficient in detecting interactions. Combined 
with independence screening, SIRI can be used to tackle the problem of 
ultra-high dimensionality when the number of predictors in the model is 
much larger than the number of observations. Second, SIRI does not im- 
pose any specific assumption on the relationship between the response and 
the predictors, and is a powerful tool for variable selections beyond lin- 
ear models and detecting predictors with unknown form of interaction ef- 
fects. As a trade-off, SIRI imposes a few assumptions on the distribution of 
the predictors. As demonstrated in our simulation studies, SIRI has com- 
petitive performance when the generative model is different from the in- 
verse model assumption. However, we found that SIRI is not very robust 
against extreme outliers on values of the predictors. Data preprocessing, 
such as quantile normalization, is advised when extreme outliers are pre- 
sented from exploratory analysis. We have implemented the SIRI procedure 
using programming language R, and the source code can be downloaded 
from http://www.people.fas.harvard.edu/~junliu/SIRI/ or requested 
from the authors directly. 

Like other stepwise procedures such as linear stepwise regression, SIRI 
may encounter issues that are typical to stepwise variable selection methods 
as discussed in Miller (1984). Imagine a simple classification example with 
two relevant predictors having the same mean and variance, but different 
correlations in two classes. Then, the predictors are undetectable to any 
stepwise procedure that selects one variable at a time. In less extreme sce- 
narios, when relevant predictors have weak marginal effects but strong joint 
effects, iterative sampling procedures such as Gibbs sampling could be more 
powerful than stepwise procedures like SIRI. This motivates us to further 
study the problem of variable selection from a Bayesian perspective. 

Finally, inverse models are not substitutes but complements to forward 
models. When a specific form is derived from solid scientific arguments, a 
forward perspective that treats the distribution of predictors as a nuisance 
can be more powerful in building predictive models. Depending on one's 
research questions and objectives, it may be helpful to alternate between 
the two perspectives in analyzing and interpreting data. 
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APPENDIX A: DETAILED PROOFS 

A.l. Proof on Properties of Likelihood-Ratio Test Statistic in 
Section 2.1. Given the set of relevant predictors indexed by A with size 
\A\ in model (2.2), we let x_4 denote a n x |^4| matrix of observed variables in 
index set A, and x^ (1 < j < n^) denote a \A\ -dimensional column vector 
of variables in index set A for jth observation in slice Sh (1 < h < H). 
We denote B A = Cav(E(X A \S(Y))), W A = E(Cov(X A \S(Y))) and fl A = 
B A + W A . The corresponding sample estimates are given by 

H 



(A.l) B A = i fl n h fa - x^) fa A - x^) T , 






and ^ = B* + WU, where 5f A = \ ^Lx £?=1 ^ and x ^ = £ ££i x 5' ■ 
Szretter and Yohai (2009) proved the following proposition: 

Proposition 3. Let C be the orthogonal matrix [ci, . . . ,c p ], where Cj is 



tie 



an eigenvector of B, W A B, = C@C T corresponding to the eigenval 
9j, where 9\ > 62 > ■ ■ ■ > 9 P . O is the diagonal matrix with 0\, 62, ■ ■ ■ , p in 
the diagonal. C r is the matrix with the first r columns of C. 

(a) The maximum likelihood estimate ofT, = Y, A = Cov (X_4|S*(y)) in model 
(2.2) is 

(A.2) t A = W A + B A l2 C^ q C T v _ q B\ 12 . 

(b) LetUi, 1 < i < p, be orthogonal eigenvectors of norm one o/X^ B A T, . 
corresponding to the eigenvalues u\ > 0J2 > • • • > w p ■ The maximum likeli- 
hood estimate of Hh = /A = E (X^jY G Sh) in model (2.2) is 

P$) = tfu K U^ 1/2 fa A -5t A )+5L A ,l<h< H, 

where U q = [ui,U2, • • • , u q ]. Then, 'jT A is the orthogonal projection, using 
the norm associated to Y< A , of (x A — x^) on the q-dimensional affine sub- 
space x^ + V 9 , where Y q is spanned by f T> A Ui, S^ U2, . . . , S_^ u g 

(c) Ti A can also be written as 

-, H n h 

1 V-V-/ftj Mh)\(hj Mh)\ T 



h=X j=l 
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£_4 Uj is an eigenvector of Bj^Wj^ corresponding to eigenvalue l/6 p -i+i, 
1 < % < p. 

Properties of the likelihood-ratio test statistic proposed in Section 2.1 are 
summarized in the following proposition: 

Proposition 4. Given the current set of selected predictors indexed by 
C with dimension \C\ = d and another predictor indexed by j £ C, the scaled 
log-likelihood-ratio test statistic for testing 

H :A = C v.s. Hi : A = C U {j}, 

can be written as 

(A-3) D jie = - log (L jic ) = £ log [l + -fz^T) > 

where q is the dimension of sub space Y q defined in model (2.2), X k and A fc 
are the kth largest eigenvalues of Q^ Be and ^Krus ■\}B\c\j{j}\> respectively, 

and A^ and X k + are their estimates. For any fixed slicing scheme and the 
true relevant predictors indexed by A, we let A& be the kth largest eigenvalue 
of Q,^ Bj^. We further assume that Ai > A2 > • • • > \ > 0. 
(a) Under the assumption that A C C, 



(A.4) nDflc-Y,- 



1 n[\f +1 -\? 



i=i 



1-A 



d+l 



which is asymptotically equivalent to the correlation pursuit (COP) test 
statistic defined in Zhong et al. (2012) and has an asymptotic distribution 

ofx 2 (q)- 

(b) Under the same conditions as in (a), 

( n ^ic) ^(z~2 z kj) ' and s ( n ^i c ) ^fM\^2 4 » 

J \fc=l / jeC c J J \j=l / 

where z k = (z k j) jeCc ~ MVN[0, [Corr{X i ,XA~Xc)] i . &Cc \ and z k 's are in- 
dependent. 

(c) As n — ^ 00, 

Dj\C ^> D m 

lo L | VarjMj) - CovjMj^c) [CovjXc)}- 1 CovjM^Xcf) 
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where Mj =E(X j \X c ,S(Y)) > Vj = Var(Xj\Xc,S(Y)) and S(Y) = h when 
Y G Sh (i- < h < iJ). Vj- does not depend on Y under the assumption in 
model (2.2). Furthermore, 

D j{c = iffEiXjlXcY G S h ) =E(X i |X c ),l < fc < H. 

Proof of Proposition 4. One can show that 



2 i \ 
D j\c = -^[ L j\c) 



log 



det ' fi [CuO}] S [Cu{i}] 



log 



det \£t x %c 



To prove (A. 3), we just need to show that 



log 



det(Vs c )] = J>g(l-A?), 



i=l 



where d = |C| and Af is the ith largest eigenvalue of Q c Be corresponding 
to eigenvector rj i , 1 < i < d. Since Bcr\ i = Xf^lcVi an d ^c = Wc + Be, 



Then, 



and 



&cVi = yB c r}i 



WcB^SlcVi 



1-A? 



;Wc»fc 






Vera, 



B c ^w c B c ^) B^n CVi = l^t-^ncm. 

Thus, the eigenvalues of B c WqB c are given by 6d-i+i = - d ' , 1 < 

i 

i < d. Let Cj be an eigenvector of B c WqB c = CGC T corresponding 
to the eigenvalue #j. We will prove that the eigenvalues of S^7 i?c are 



LU; 



9 d-i+i l-xf 

i _ Td 



if 1< i < 



i+tf d - 



Af if g + 1 < i < d, 



1-V2, 



with corresponding eigenvectors given by b^ = B c c^-j+i. According to 
(A. 2) in Proposition 3, 



B. 



-1/2; 



-!/2, 



-1/2 



|V2 # 



<C R l/2 



( . ^Wc + -B c C|^|_ q C d _ gJ B c \B C C d _,; + i 



-V2, 



C ^C-^c c d-i+l = B. 

= (b c l, 'W c B c l i* + GVfcGXJ c d _ i+1 , 
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where B c W C B C c d _j + i = dd-i+i, C d - q Cj_ q c d - i+1 is for 1 < i < 
and 1 for q + 1 < i < d. Thus, 

a c 2-*C&c c d-i+l — — Cd-i+1, 



and 



Therefore, 



log 



det OT £c 



S c Bchi = iOihi, 1 < i < d. 



log [det fe^c) - log det fe 1 B C 



J>g(l-A 



i=l 



To prove (a) and (b), note that 



nD 



3\C 



-n 



vi=l 



£log 1-Af 1 -£log l-A 



»=i 



/? 



J>g 1 + 



A? +1 -A? 



i=i 



i - a^ 1 ; 

Given that A <Z C, Zhong et al. (2012) showed that 

n ( Xf +1 - Xf) 

-,i = 1,2,..., q, 



d+l 



l-A 

are asymptotically independent and identically distributed as x 2 (l)- Thus, 



A' 



d+l 



A? 



1 - Af +1 



0,i = 1,2,... 



and 



nD 



* / A? +1 -A? 



i n(\ 

E 



d+l 



,=i 1-Af 1 



Conditioning on variables in C, variables in C c follows a multivariate nor- 
mal distribution with the same mean and variance across slices. Then, the 
proofs of (a) and (b) directly follow from the Theorem 1 and Theorem 2 in 
Zhong et al. (2012). 
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For (c), since Xf — > Xf, for i = 1, 2, . . . , d, 



D 



3\C 



i=l 



£log 1-A^ 1 +J>g 1-A 



«=i 



log [det (n^ u{j}] W mJ}] )] + log [det (n^W c )] 



log 



log 



det (tt 



icu{j}]j 



det (O c ) 



log 



det (W\ 



icum, 



det (W c 



Var (Xj) - Cov (X,-, X c ) [Cov (Xc)}' 1 Cov (X,-, Xc) 5 



-log(V-; 
log I 1 + 



Var (Mj) - Cov (Mj, X c ) [Cov (Xc)] -1 Cov (M,-,X C 

V 



The last equality follows from Var (Xj) = Var (Mj)+E(Vj) = Var (Mj) + Vj 
and Cov^.Xc) = Cov(Mf,X c ), where Mj = E(Xj\X c ,S(Y)), Vj = 
Var (Xj|Xcj S(Y)) and Vj is a constant that does not depend on Xc or 
S(Y). Note that 

Var (Mj) > Cov (Mj, X c ) [Cov (Xc)p 1 Cov (M,-, X C ) T 

where the equality holds if and only if Mj = E (Xj |Xc, S'(y)) is a linear com- 
bination of Xc that does not depend on S(Y), that is, E (X,-|Xc, Y £ Sh) = 
E (Xj\~Kc) f° r 1 < h < H under the normality assumption. □ 

A. 2. Proof of Theorem 1 in Section 2.1. To prove Theorem 1, we 
will need the following two lemmas. 

Lemma 1. Under the same conditions as in Theorem 1, there exist k > 
and £i > such that for any set of predictors indexed by C and C c (lA ^ 0, 



max 
jeC c nA 



Var(Mj) - Cov(Mj,X c ) [Cov^Xg)]' 1 Cov(Mj,X c ) T 



> fin" 



where Mj = E (X,|Xc, S(Y)), Vj = Var(Xj\Xc, S(Y)), and Vj is a constant 
that does not depend on Xc or S(Y). 

Lemma 2. Under the same conditions as in Theorem 1, for any set of 
predictors indexed by C, we let Af be the ith largest eigenvalue of Q^ Be 
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and let X^ be the ith largest eigenvalue of Q,^ Bq. Then, for < e < 1 and 
i = 1, 2, . . . , q, there exist positive constants C\ and C<i such that 

(A.5) Pr( max log (l - Af ) - log (l - Af ) > e J 

/ r 4 ■ e 2 

< 2p(p + l)Ci exp -C 2 n- mm 



64t r> 



' maxP 

We will start with the proofs of two lemmas. 

Proof of Lemma 1. Let B = C n A, S = C n A c , and V = C c n A / 0. 
Under model (2.2), 

Xu|X B , Y e S h ~ iV (a^j + /35| B X B , S x = X!x>|£?J . 

Let a^, = (ad 6 x>) , where a!- is defined in (2.5). Then, a v = ^a^L, 
where * is a |2?| by |P| matrix that satisfies ^ T * = E^AfjEJf 1 , Ad = 
diag ( o" 2 i_4 r • i J and c^ri is defined in (2.5). Under Condition 1, 

°j\A-{j} - Tmax and Amax ( S r X ) < ^-, and A max (tf T tf) < (*%*) ■ Thus ' 

trace (Var (ax>(Y))) = trace (War (a v \ B {Y)) ^ T ) 

< ( -^ ) trace (Var (a v \ B (Y))) , 

where ax>(Y) = a v and axiisC^) = QW B when 7£S/,. Furthermore, 
trace (Var (a P | S (Y))) > |X>| (l^\ £ n -«. 

\7"max/ 

Assume 

Cov(X £ ; u x,|X B ) = ( ° 01 

Under model (2.2), conditioning on X_4 = Xgux>, the distribution of Xg 
does not depend on the slice. Therefore, the conditional distribution of Xp 
given Xc = Xgyg can be written as 

Xx>|X c , Y e S h ~ N \a£\ C + /35|c x C) s £>|c 
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where a$ c = a + MaJ, M = Ef (l {v{ - N T (l, £ | + NN T ) _1 n) S^, 

_ i _i 

N = S 2 SoiS 1 2 and ao is a constant that does not depend on the slice. 

Since A max (NN T ) < ^#4 < ^, A min (Si) > r min and A min (S^ 1 ) > 

v 7 A min^Oj 'mm \ j. / 

, we have 

A min (l P | - N T (I| £ , + NN T )^ N) > — \ 



1 'maj: ^- _1_ T- 

T _ . ' max n^ ' i 



max T^ ' mm 



trace (Var (a v \ c (Y))) = trace (MVar {a v \ B (Y)) M T ) 

> A mi „ (Sr 1 ) A m i„ (Si) A min (l|23| " N T (l| £ | + NN T )^ N 

x trace (Var (ax>|BQO)) 

> ( - T 7 r . ) 2 (^)(VarK| g (y))) 

\ ' max ~r ' mm / \ ' max / 

> \v\ (_^-) 2 (^) 3 ^„- 

\ ' max ~T ' mm / \ ' max / 

Thus, there exists j £ T> = C c C\A such that 

Var (a, |C (y» > (;^— f (^) V"- 

\ ' max ~T ' mm / \ ' max / 

For such j, we have Mj = (Xj\ C (Y) + /3jf c X c , V,- = E^ < r max , and 

Var (Mj) - Cov (Mj, X c ) [Cov (X c )] _1 Cov (M,-, X C ) T 
= Var (a Jlc (Y)) - Cov (a JIC (Y),E(X C \S(Y))) [Cov (Xc)p 1 
Cov{a 3lc (Y),E(Xc\S(Y))) T . 

Let Ti = Cov (E (X C \S(Y))) and T 2 = E (Cov (X C |S(Y))) = Cov (Xc\S(Y)) 
Since A min (T 2 ) > r min and A max (Ti) < A max (Cov (X c )) < r max , we have 

A min (tVT 2 T\ ^ > ^ and 



< 



Tmax 

Cov { aj \ C (Y),E (Xc\S(Y))) [Cov (Xc)]" 1 Cov (a jlc (Y),E (X C \S(Y))) T 

Cov (a j1c (F), E (Xc\S(Y))) (T, + T,)- 1 Cov (a,, c (V), E (X C |5(V))) T 

Cov(a J | C (y),E(X c |g(V)))T r 1 Cov(a J | C (y),E(X c |g(V))) T 

i _ i 

2T T 2 



1 + Amin I T 1 2 T 2 T 1 



< - "™ Var(a j|c (Y)) 

' mm ~r ' max 
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Therefore, 



Var (Mj) - Cov (M j} X c ) [Cov (X c )] _1 Cov (M j} X c ) 



V r ; 



> 



> 



1 



7ti 



^rnax ^rnin T" ^max 

1 



Var (a i|c (y)) 



3 / x 3 

^miri \ / 7"min 



''"max \ 'mm "T 7"min 



£«"", 



where £i 



max \ 'max i' mm 



* 3 / \ 3 

' i 1 1 1 i > 1 ( T min 1 C 



U 



Proof of Lemma 2. Since Xc follows either a multivariate normal dis- 
tribution or a finite mixture of multivariate normal distributions, under 
Condition 1, one can show that for % = 1,2, ... ,g and any e > there 
exist constant C\ and C2 such that 



Pr I max 

v Cc{l,2,...,p} 



A^-A 



> e < 2p(p + l)Ci exp -C 2 n 



r 2 - e 2 
16p 2 



following similar arguments in the proof of Lemma 2 in Zhong et al. (2012). 
Since n c = W c + B c , where fl c = Cov (X c ) and W c = E (Cov (X C |,S(Y))) = 
Cov(Xc|£(F)) under model (2.2), A max (^c) < 7" max and A min (Wc) > r min 
Thus, 



c r] 1 Ben , . v w c v . . 

X\ =- max -^ttz — = 1 — mm rn „ ^ I 



|tj||=1 T? T ^ C r? 



min| N | =1 r7 T W C T7 r min 



tj| |=i rftlcV max|| T? || =1 ?7 r r2cr7 r max ' 



and 



1 - AV > 1 - Ai > 



^5., fori = 1,2, 



Therefore, for i = 1, 2, . . . , q and < e < 1, there exist constant C\ and C2 
such that 



Pr [ max 

VCc{l,2,...,p} 



log (1 - x9 

A? -A? 



log 1 - A 



C\ 



> e 



< Pr 



max 



Cc{i,2,.., P } 1 - Af 



> \ 



< Pr I max 

\Cc{l,2,...,p} 



A, - — A 7 - 



> 



2r 



< 2p(p + l)Ci exp -C 2 n 



r 4 - e 2 

mm 
" maxr 



as n — )• 00. 



□ 
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Proof of Theorem 1. Let Re = £f=i log (l - A?)-£Li log (l - Af). 
Then, according to Lemma 2, for < e < 1, there exist constant C\ and Ci 
such that 

Pr ( max \R C \ > e) < 2p(p + l)qC x exp f-C , 2 w g /? in % = ) . 
\cc{i,2,..., P } J V 64r2 axP 2g2y 

Under Condition 2,p = o (n p ) with 2^+2^ < 1, and for any positive constant 
C, 

Pr ( max l-Rcl > Cn _K 
\Cc{l,2,...,p} 

< 2p(p + l)gd exp (-C 2 n 1 - 2K - 2 P ^ C 2 

as n — )• oo. For j ^ C and d = |C|, 



% = -E^-^J+^M 1 "^ 

= - E log (l - Af +1 ) + E log (l - A? ) - R [Cu{j}] + ifc 



i=l i=l 



lo A | Var (M,-) - Cov (M,-, X c ) [Cov (Xg)p 1 Cov (M,, X c 



Vj 



- R [cu{j}} +Rc, 



where Mj = E (X,-|Xc, S"(V)), Vj = Var (X,-|Xc, S(y)), and Vj is a constant 
that does not depend on Xc or S(Y) under model (2.2). 

When C c n A ^ 0, according to Lemma 1, there exist n > and £i > 
such that 



max 
jeC c nA 



Var (Mj) - Cov (M^Xg) [Cov (Xc)]" 1 Cov (M i; X c ) 

V5 



> Cm"", 



Then, for sufficiently large n, there exists j £ C c (~)A such that 



lo L + Vax(Mj) - Cov(Mj,Xc) [Cov(Xc)]- 1 Cov (M i; X c ) T \ > |i n _„ 



V) 



and 

B i|C >|n-«-(|i? [Cu{i}] | + |i?c|). 



2 
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Pr ( max \R C \ > -n K 

\cc{i,2,...,p} 2 



Pr I min max Dj\r > en K ) — > 1, 
\C:C c nA^ jeC c a4 J ' ) 

as n — > oo. 

When variable C c n A = 0, for j G C c C -4 C , Afj = E (-X^Xc, 5(F)) 

E(Xj|Xc) is a linear combination of Xc under model (2.2), and 



Var (Mj) - Cov (Mj, X c ) [Cov (X C )P Cov (M,-, X c )^ 

Thus, 

D j\c < {\R[Cu{j}]\ + \Rc\) , 
and 

Pr | max max5,-i r > CvT K | < Pr ( max \Rr\ > —n~ K | ->■ 
VC:C=n^=0 jeC c Jl / V c c{i,2,..., P } 2 / 

for any positive constant C as n — > oo. D 

A. 3. Proof of Proposition 2 in Section 2.2. We start with a simple 
case to get some intuitions behind the proof. Suppose there are two different 
sets of minimally relevant predictors {X\} and {^2}- Under model (2.7) 
and Condition 1, the support for the joint distribution of two predictors 
X\ and X2 is the entire 2-dimensional space M?. If X\ _LL S(Y)\X 2 and 
X 2 AL S(Y)\X 1 , then we have Pt(S(Y)\Xi = x x ,X 2 = x 2 ) = Vi{S{Y)\X 2 = 
x 2 ) = Pr(S(Y)|Xi = xx) for any x u x 2 G R. Thus, both Pi(S(Y)\X 2 ) 
and Pr(S(Y)\Xi) are constants, and X\ _LL S(Y) and X 2 _LL S(Y), which is 
contradictory to the assumption that {X\} and {X 2 } are minimally relevant. 

Proof of Proposition 2. Suppose there are two different sets of mini- 
mally relevant predictors indexed by B and C, respectively. Define A\ = BnC, 
A 2 = BC\C C and A 3 = B c n C. Then, we must have A 2 + and ^3 / 0. The 
conditional distribution of X^ 2 given X_4 X and slice Sh can be written as 

X M \X Al ,Y G 5 h ~ AT (aj^ iB^X^.f = £^ 

Let a^ = a ( £ lAi and B^ = BJJ^ = fofA . Since B is minimally 
relevant, at least one of a^, B;- (j G A\) and T> 2 has to be different 
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across slices. The conditional distribution of X_4 3 given X_4 lUw 4 2 = Xg is the 
same across slices, 

X_4 3 |X B ,Y E S h ~ N(a + bX Al +cX^ 2 ,S 3 = S B ) . 

The conditional distribution of X_4 3 given X_4 X and slice Sh is 

X^ 3 |X^ ,YeS h ~N(a + ca^ + (b + cB^) X^ , S 3 + cS^ } c T ) , 



Since ^,3 G C and C is minimally relevant, at least one of ccr ', cYl- (j G 

Ai) and M' ' = cS;, c T has to be different across slices. Given X_4 lLJ _4 3 = 
Xc and slice Sh , the conditional distribution of cX_4 2 is 

cX.4 2 |X c ,y G 5 h ~ iV (V h) +d('')x /i + cWx^.eW 



where C< h > = MW (S 3 + MW) \ 7M = -C^o + (l|^ 3 | - C^) cqW, 

D (h) = _ c Wb+(I|_4 3 | - C^) cBW, ew = mW-mw (s 3 + mC 1 )) -1 mW. 

First, if M^ 1 ) (/i = 1,2,... ,H) are different across slices. Then NW = 

— - ' — 1 —i 

S 3 2 M^'S 3 2 = rWAC 1 ' IT' 1 ] are different across slices (note that under 

1 
Condition 1, S3 is invertible). We have 



1 „, ,., / „A -1 



(A.6) S 3 2 SWS 3 2 = nW-NW(I|^[+nW) N^ 



rWAWfl^+AW") 



-1 



■ph 



-1 



_JL _ 1 

Thus, S 3 2 E^ 'S3 2 and S^ > are different across slices. 

Second, if MA > = M (h = 1, 2, . . . , i?) are the same across slices. Then 
at least one of ccr ' and cB^- (J G »4i) has to be different across slices. 
Without loss of generality, assume ccr ' are different across slices, that is, 
trace (Var (ca(Y))) > 0, where a(Y) = a>( h > when Y G Sh- Under Condi- 
tion 1, A max (N) < 2ml, A min (S 3 ) > r min and A min (S3 T ) > ^-. Then, 

Tmin \ o / 'max 



1. 
2 

3 ' 



C (ft) = C = I M - M (S 3 + M)- 1 = S3 2 (l A3 - N (l4 3 + N)- 1 ) S " 
and 

trace (Var (7(F))) = trace (CVar {ca{Y)) C T ) 
> A min (S3- 1 ) A min (S 3 ) A min (Ia 3 ~ N (U 3 + N)- 1 ) trace (Var (ca(V))) 



> ( — (-= trace (Var (ca(r)))>0, 

, ''"max "T Train / \ ''"min / 
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where y(Y) = 7^ ' when Y G Sh- Thus, 7' ' are different across slices. 

Therefore, given Xc and slice Sh, the conditional distribution of cX_4 2 
depends on slice Sh, which is contradictory with the previous assumption 
that A2 & C c and the conditional distribution of X_4 2 given Xc is the same 
across slices. So we must have B = C. □ 

A. 4. Proof on Properties of Augmented Likelihood-Ratio Test 
Statistic in Section 2.2. Properties of the likelihood-ratio test statistic 
proposed in Section 2.2 are summarized in the following proposition: 

Proposition 5. Given the current set of selected predictors indexed by 
C with dimension \C\ = d and another predictor indexed by j ^ C, the scaled 
log-likelihood-ratio test statistic for testing 

H Q :A = C v.s. Hi : A = C U {j}, 

under the augmented model (2.7) can be written as 

H 



(A.7) ^|c = log^c-£v lo S 



where 



a j\c 



h=l 

2 



a j\C 



is the estimated variance by regressing Xj on Xc in slice Sh, 



and a , c is the estimated variance by regressing Xj on Xc using all the 
observations. 

(a) For any fixed slicing scheme and A C C, 

% ~ log (l + *-) -£ £ log ( =g^-) 4 X >((H- m+ 2», 

V E h =iQhJ h =i n \Z h =iQh/n) 

where Q ~ X 2 ((H - l)(d + 1)) and Q h ~ x 2 (n h - (d+ 1)) (I < h < H) are 
mutually independent. 

(b) Under the same condition as in (a), 

/(H-l){d+l) H-l 

KvU- £ 4 + E 

\ i=1 i=1 I no- 

where Zj 's and z% 's are mutually independent with 

z * = inject ~ MVN(o, [CorriX^XklXc)]^^ 
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and 

% = (%) i6Cc ~ MVNfo, [Cor? (X 3 ,X k \X c )} jMC < 

(c) For any fixed slicing scheme, as n — > oo, 

D*\c ^ D* lc 

( VarjMj ) - Cov {Mj , X c ) [ Cov (X c )] ~ 1 Cov (Mj , X c ) T \ 

+ logE(^)-Elog(^) 

where Mj =K(X j \X c ,S(Y)), Vj = Var(Xj\Xc,S(Y)) and S(Y) = h when 
Y € Sh (1 < h < if J. Furthermore, 

D* lc = iff E(X j \X c ,Y€S h )=E(X j \X c ) and 
Var(Xj\Xc,Ye S h ) = Var(Xj\Xc) , 

forl<h<H. 

PROOF of Proposition 5. (a) Let xc = [l n ,xe], where l n is a n-dimensional 
column vector of 1 and I n is a n by n identity matrix. We denote Pq = 
^(xpq;)" 1 ^, 



P h = dia 9 U...,ScP(^ T ^y 1 ^ T ,...,0). 



R h = diag(0, . . . , I nh , . . . , 0) - P h and R = I n - P . Then 



— xj R hXd = —o-i c Q h = -^— , for h = 1,2, . . . ,H 

n h J n h J ^ n s h 



and 



^?2 



ixjEox, = ixj flX) ** + ^ X J f XX " p o) x, 

ii, i 

where cr^ c is the conditional variance of Xj given Xc for j 6 C c and icC. 
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The augmented likelihood-ratio test statistic in (A. 7) can be written as 



H ,2 

?0O1 



log a 2 



5 i|c = - [Y, Shlo %l u j\c\ ~^"j\c 

\h=l / 

Note that both ( X^ft=i -^ ~~ -^b ) anci -R/i's are orthogonal to x^. Given that 

j G C c , according to Cochran's theorem, Qq = xj 1 [J2h=i^ > h — Pq) x j/°' 2 |c 
and Qh = xj RhX-j / <7 2 i C are independent, and 

Qo ~ x 2 ((^-l)(^+l)), and Q h ~ x 2 ( n/l -(d+l)), for /i = 1, 2, . . . , ff, and d=\C\, 

Let nl = n/j.— (d+1) and n' = n — H(d + 1). For any fixed slicing scheme, 
as n — > oo, n' h — > oo, n! — > oo and n' h /rih — > 1, n'/n — >• 1 given that d/n — >• 0. 
Since 



" Qq _ go/W p 

Zh=iQh~Zh=iQh/n> 



Qo p. n Hh=iQh p. , -ci. -el.,- , .. 

we have 



^ = nb g (i + -#-)A^-^— ^o il -.r,(//-i), ( / + i), 

V z2h=iQhJ n Eh=iQh/ri 



Since 



0*/-* _<M QhK p. hfoeh = 1 *„„ 



and 



YLxQh ' n '/ n EtiQh/n* 

J2h=i Qh = n' Ylh=i Qh -?_ 1 



L, ^, . . . , J. J. , 



/? 
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U 



j 



E, ( Qh/sh \ v^ i / i , Qft/- 

h=i \2^h=i^h/ h=1 \ 2^h=i 



H 



E/ Qh/sh 



h=l 

11 



X<h=iQt- 



■V£ 



"-/* 



/i=i 






Qh/sh 






Qh/n h 



i Ef=i "ft (Qft/wft - Ef=i <5^/ n 



h=l 



Eh=i <2ft/ n 



># 



Eh=i <2ft/ n 



1 H / H \ 



n/i Qa 



n \J2rih 



Let % 






forh=l,2,...,H. Then 



<7fe 




Z_,i=i 



(ft) 



n h a/2t4 



where z^ • ~ iV(0, 1) independently for i = 1,2, . . . ,n' h and ft, = 1,2, ... ,H. 

Thus, according to central limit theorem, q^ — > N(0, 1), for ft = 1,2, ... ,H, 
and let q = (gi, . . . , qn) ■ Then, 



T 



where J = ( y/ni/n, . . . , ynH~fn\ and J T J = 1. According to Cochran's 



theorem, Uj is asymptotically x 2 (H — 1). Since Qq is independent of Qy, 
(h = 1,2,..., H), Uj is asymptotically independent of Uj and 

"% = u i + ^ ^ X 2 ((H - l)(d + 2)). 
To prove (b), for any j £ C c , we denote 



^ - Q} 



(o) _ J 



xj (Ef =1 Ph - Po)^_ = ^ d+1 ) 



a lc 



z 2 - 



i=l 
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where Z{j ~ iV(0, 1) independently for the same j and i = 1,2,..., (i? — 
l)(d + 1), and Cov(zij, Zij/) = Corr (Xj,Xj/|Xc) for j' ^ j. For any j and 
/iG {1,2,..., 77} , we denote 



Q 



(h) 



xjR h Xj 



a 



J\C 



h r 
i=l 



CO 



and q)' 



(/,.) 



Q 



(h) 



/2nh' 



where z\ ■ ~ X(0, 1) independently for the same j and i = 1, 2, . . . , n' h , and 
Cov(zg° , z%) ) = Corr (Xj , X,-, |X C ) for / / j. Since Cov 
2Corr (Xj, Xj/|Xc) , for /i = 1,2, ... .H and j ^ j', we have 






_C0 



,W » 



n' 



>C») nCO 



-^Corr 2 (X,-,Xj,|X c ) ->• Corr 2 (Xj,Xj,|Xc) 



Hence 



Ui ~ 



if 



'3 z^ ^ 

h=l 



(h) 



' if 

£ 

V/l=l 




n 



» 



2 (H-l) 



z 2 - 



1=1 



where z^ ~ iV(0, 1) independent for the same j and i = 1,2,..., (H — 1), 
and for j' / j, Cav(zij, Zif) = Corr 2 (Xj, Xj>\X.c). Therefore, 



nD* 



j\c 



J'6C C 



Uj + Uj 



'(H-l)(d+l) 



(H-l) 



iec c 



E 4+ E 

(c) For any fixed slicing scheme, as n — > oo, nh — > oo for h = 1,2, 
Under the normality assumption, as n^ — > oo, 



77. 






a j\c 



Var (Xj-|y e S h ) - Cov (X„ X c |y G S h ) [Cov (X c |y G S^)] -1 x 



Cov(Xj,Xc\Y€S h ) 
= Vfx(Xj\Xc,Y€S h ) 

and as n — > oo, 



T 



2 *\ Jl 



°3\C 



a 



:i\c 



Var (Xj) - Cov (Xj, X c ) [Cov (Xc)]" 1 Cov (X,-, X C ) T 
Var (Xj) - Cov (E (Xj\X c , S(Y)) , X c ) [Cov (Xc)]" 1 > 



\T 



Cov(E(X J |X c ,5(V)),X c ) J . 
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Let Mj = E {Xj\Xc, S(Y)) and Vj = Var (Vj|X c , S(Y)). Then, Var (Xj) 

Var(Mj)+E(Vj), and 



ii 



H 

-> log a\ c - Y^ s h log 



a 



CO 
3\C 



a 



CO 
3\C 



h=l 

log (E (Vj) + Var (Mj) - Cov (M,-, X c ) [Cov (X c )] _1 Cov (Mj,X c 

H 



T 



Y,s h log(Vsx(Xj\X c ,Y G S h )) 



h=\ 



Since Var (Xj\~Kc,Y G S^) is a constant that does not depend on X<j under 
the normality assumption, 



ii 



•,log(Vj) = J> h log (Var (X,\X C , Y G S h )) , 



/i=i 



and thus 
n* — ^ 



log ME (Fj) + Var (Mj) - Cov (Mj, Xc) [Cov (Xc)f 1 Cov (Mj, X C ) T 
-Elog(Vj) 



/ Var (Mj ) - Cov (Mj , X c ) [Cov (Xg )] ~ 1 Cov (Mj , X c ) 
+ log(EFj)-Elog(Vj). 



Note that 



Var (Mj) > Cov (Mj, X c ) [Cov (X c )] _1 Cov (Mj,X c ) T 

where the equality holds if and only if Mj = E(Xj\Xc,S(Y)) is a lin- 
ear combination of Xc that does not depend on S(Y), that is, Mj = 
E (Xj\~X.c,S(Y)) = E(Vj|Xc) under the normality assumption. Further- 
more, according to Jensen's inequality 



log(EVj) >Elog(V J 



31 ' 



where the equality holds if and only if Vj = Var (X,-|Xc, S(Y)) = EVj, or 
equivalently, Var (Vj|Xc, Y G Sh) is a constant for h = 1,2, ... ,H. Com- 
bined with Mj = E(Vj|X c ), Var(Vj|X c ) = E(Vj|X c ) + Var(Mj|X c ) = 
Vj =\ai(Xj\X c ,S(Y)). ' D 
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A. 5. Proof of Theorem 2 in Section 2.2. To prove Theorem 2, we 
will need the following lemma. 

Lemma 3. Under the same condition as in Theorem 2, for < e < 1, 
there exist positive constants C\ and Ci such that 



Pr ( max max 

Vcc{i,2,.., P } jec c 



toggle -tog^c 



>e < 



Pip + ly 



C\ exp I —C 2 n 



> 2 L 2 



and 



Pr max max 
\Cc{i,2,...,p} jeC c 



# 



fc=i 



fl 



^s/jlog aj| C } -^Sfelog 



h=l 






> e 



< 



flp(p + 1; 



C\ exp I —C 2 n 



H 2 p 2 L 2 



where L 



t 3(Z*m) +1 j. 



Proof of Lemma 3. We denote Ve = Cov(Xc) = (vj u fo). • eC , r^c = 
Cov(Xj,X c ) and v jtj = Var(Xj), and Vfc = ivji,j 2 )j u j 2 ec ^ifi and %J 
are the corresponding sample estimates. Then, a 2 , c = Vjj — rJ c V^~ Vj ; c, 

°)\c = Vj,j ~ rffiV^rjfi, and 



~2 2 

a j\c ~ a j\e 



< Kj- U i,jl + 



f Ic^c lf i,c - rl c V c x r ifi + |rJ c F c ^e - rj fi V c 1 r jfi \ 



Let M = max Jli32e { li2] ..., p } \v jl 



h v ji,J2 



We have 



max \vj j — Vj ,-| < M, 
je{i,2,..., P }' JJ 



and 



I^C^" 1 ^-^^- 1 ^! 



f Jc^c 1? j,c - rj >c V c % e 



\Trr-l 



irjfi-rjfi) V c i^c + rjfi) 



V c %c) (Vc - Vc) (V c % c ) 
Vf (% - V C ) rf 2 , 



= IW1IIIW2I 

where ^ = V^fjfi, V2 = V c ~%c, and rj\ = jjjfa, rf 2 ^ 
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Since 

Tmin < min X m m{Vc} < max A max {Vc} < T max , 
Cc{l,2,...,p} Cc{l,2,...,p} 

using similar arguments to the proof of Lemma 1 in Wang (2009) with 
p = o (n p ) and 2p + 2k < 1, we have 

2~ 1 r min < min A min {V£} < max X max {V c } < 2r max . 

Cc{l,2,...,p} Cc{l,2,...,p} 

Then, 1 1 V£" ' Tjfi 1 1 2 = rfcV^rj^ < %j < 2r max , ||^ /2 || < V^max, and 
Hf-Vaji < pi Th ' 



and 



I II I I T T — 1 — II ^* 11X7" -L / ^ I I I I T 7" J- / ■"■ — II ^-" r\ 

Ir/ill = IIV^, r jfi \\ < \\V C \\\\V C rjfiW <2 



i I, 1 1 T ^ — i — II ^ iiT^-iiniT^ 1 / 2 !!!!-!^- 1 / 2 ^ ii ^ 2r max 
l»72ll = ll^ c r j:C \\<\\V c \\\\V C ' \\\\V C rj-.cH < — 



rjf (v c -V C )v*2 < . . max \v juJ2 - v juh \ ^ l^ll^l 

V / 11,7^611,2 v\ — 



Furthermore, 

r (Vr.-vAn% < 

jij'26{l,2,...,p} 

Hence, 



JU2 



3/2 



\ 'min / 

Since ||r,-, c -r.,- )C || < max juhe{h2 ,...,p} \v juh - v jlth \ y/p = M^p, WV^r^cW < 



^^, and \\V r 1 r-jc\\ < \ Im ^, we have 

^min L, Ji W Tmin 

(fjfi - Tjfif V^ 1 (fj,c + r jfi ) < \\rjfi - Tjfi\\ (||V c _1 fj iC || + \\Vc l r jfi \ 



1 l^+ &IMVP, 



and 



\^T rr—1-^ T T /-l I ^ I 2T max /T max \ 

l r i,c^c r i,c - r jfi V c Tj fi \ < — + J- — MVP 

\ 'min V 'min / 



< j 2 1 ^a 3/2 + 1 ) MV p. 



T, 



mm 
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Therefore, 



max max 
Cc{i,2,...,p} jeC c 



°\c ~ rf\c 



<2 3 



3/2 



+ 1 Mp. 



Since cr| |C = v jd - rJ fi V c \ c > Vj,j > r mi „, 



max max 



d h ~ °lc 



< 



Cc{i,2,...,p} jeC c a 



M 



3/2 



+ 1 Mp. 



Let L = -^- I 3(5 



3/2 



+ 1 ) . For < e < 1, we have 



Pr ( max max 
\Cc{i,2,..., P } jec c 



< Pr I max max 



loga] lc -loga 2 Jlc 



> e 



d \c ~ °lc 



Cc{l,2,...,p} J6C C 0-f 



> 



< Pr [ M 



max |w 

Jl,J26{l,2,...,p} 

< E Pr (i% 

jij 2 e{i,2,...,p} 



J|C 



J1J2 



Uj ' 1J ' al> pL 



J2 V J1,J2 I > 



pL 



< 



p(p + 1) 



C\ exp I — C*2n 



p- 



2 L 2 / ' 



where the last inequality follows from Bernstein inequality since predictors 
X follows either a multivariate normal distribution or a finite mixture of 
multivariate normal distributions. Similarly, 



Pr ( max max 

v Cc{i,2,..., P } jeC c 



log 



a 



(h) 

j\c 

2 



log 



a 



(h) 
3\C 



> e 



. P(.P+ 1 ) r , l n s h£ 

< Ci exp ( -C 2 n^— 2 
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H 



h=l 



< 



< 



< 



Pr max max 

\Cc{l,2,-,p} jeC c 

H ( 

} Pr I max max 

£-J \Cc{i,2,..., P } jec? 
11 



E Sh log 

log 



a 



CO 

3\C 



11 



E sh log 



-CO 

J i\c 



h=l 



> e 



a 



CO 



log 



a 



(h) 
3\C 



> 



s h H 



E 



PiP + ly 



C\ exp I —Cin 



Hp{p + 1) 



ShH 2 p 2 L 2 






2 — Ciexp r C2n i?wj- 



D 



PROOF of Theorem 2. We denote _R.,| C = log5^, c - logu^, c and 



'j\C 



H 



R j\c = E Shlog 



/i=l 



a 



CO 



fl 



E Sh log 



/i=i 






According to Lemma 3, for < e < 1, there exist constant C\ and C 2 such 
that 

1 j|C| 



)p(p +1) / 6' 

< C\ exp — Con-TT-io , 
2 V p 2 L 2 J 



and 



Pr ( max max 

v Cc{i,2,..., P } jeC c 



R 



■i\c 



>e < 



fl"p(p+i; 



C\ exp I —C2n 



H 2 p 2 L 2 



where L 
2k < 1, 



3 5 



3/2 



+ 1 I • Under Condition 2, p = o (n p ) and 2p + 



Pr ( max max iJ.-ir > Cn K 
v cc{i,2,..., P }iec c ' Jl ' 

-<2 



< 



pip + 1) 



C x exp (-C 2 n l - 2K - 2p ^] -)- 0, 



Pr ( max max 

v Cc{i,2,..., P } jeC c 



i? 



■J"|C 



>Cn" K 



< g ^ +1 ) gl expf-C^-^-^. 



0. 



2 — ^ - — F 2 L 2 
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for any positive constant Casn-> oo. According to Proposition 5, 
H 



D j\c = ^ga] lc -^2s h log 

h=l 
H 
= log cr|| C - ^ s h log 



<J 



(h) 



(7 



CO 



+ Rj\ C - Rj\ c 



h=l 



( Var (Mj) - Cov (M j} X c ) [Cov (Xg)]' 1 Cov (M 3 -, X c ^ 

og V E (^') 

log (EVJ-) -Elog (Vj) + fl^ic - ^| C , 

where Mj = E (X,-|X C , 5(F)) and V,- = Var (Xj|Xe, 5(F)). 

When C c n^4 ^ and all the relevant predictors indexed by A are stepwise 
detectable with constant k > 0, then there exists m > such that U™ g 71 C 
C and C c n T m ^ 0- According to Definition 3, there exists j G C c n 7^ and 
£i j £2 > such that either 



+ 



Var (Mj) - Cov (M^Xg) [Cov (X c )]~ x Cov (Mj, Xg) 

E0^) 

that is, with sufficiently large n, 



> em~ K , 



log 1 + 



i Var (Mj) - Cov (M,-, Xg) [Cov (Xg)]' 1 Cov (M j; X C ) T \ > |i n _ K 



or 



E(^) 



log (EK)- Elog (tt)>6"~" 



Let c = min (§-,%). Therefore, 



D*\ C > log 



' Var (Mj) - Cov (Mj, Xg) [Cov (Xg)]' 1 Cov (Mj, Xg) 

E(Vj-) 



+ log (EVj) -Elog (V 3 ) - [\R Jlc \ + 
> 2cn- K - (\Rj\c\+ Rj\C 



R 



:i\C 



Since 



Pr I max maxi?;ir > — u K 



Cc{l,2,..,p} j£C c 



j\c\ 
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and 

we have 
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Pr ( max max 

v CC{l,2,...,p} j€C 



R 



j\c 



> 2 n ~" I ->" °> 



Pr ( min max D*, c > en h \ — )■ 1, 



v C:C c aA^0jec c a4 
as n — > oo. 

When C C (~)A = under model (2.7), for any j e C c , Mj = E (Xj\X c , S(Y)) 
E(.Xj|Xc), which is a linear combination of predictors in Xc, and Vj = 
Var (Xj|Xc, <S(y)) = Var (X.,|Xc), which is a constant that does not de- 
pend on Xc or S(Y). Then, 



and 



Thus, 



and 



Var (Mj) - Cov (Mj, X c ) [Cov (Xc)]" 1 Cov (Mj, Xcf 
log(EV ? )-Elog(y i ) = 0. 



0, 



D*\c < \Rj\c\ + 



R 



3\C 



Pr max max D% r < Cn K ->■ 1, 

VC:C c a4=0 jec c J|t - 

for any positive constant C as n — )■ oo. 

A. 6. Proof of Theorem 3 in Section 2.3. 

Proof of Theorem 3. (a) We denote 
H 



D 



D* = log a) - ^2 s h log 



a 



CO 



h=l 



Var(E(X J |5(y))) 
E(Vai(X,-|5(y))) 

First, we will prove that 



+ logE (Var (X,-|S(y))) - Elog [Var (Xj\S(Y))} 



P ( max 

v ie{i,2,..., P } 



D*-D* 



> CrT K -»■ 0, 
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for any constant C > as n — > oo. There exist C±, C2 > such that 



P[ max I log erf - log <7^| > e ) < P , ,-.<,.-. ., 

v je{i,2,..., P } ' J ) \ js{i,2,..., P } a z 



max 



^2 2 



>€ 2 



• F I max \ a j ~ a j\ > I 



j€{l,2,...,p} 



£ *(l? 



J6{l,2,...,p} 



er-n 



-2 21 v c/ mm 



*) 



< pCi exp I —C2n 



c 'min 



where the last inequality follows from Bernstein inequality since predictors 
X follows either a multivariate normal distribution or a finite mixture of 
multivariate normal distributions. Since log(p) = 0(11') with 7 > and 
7 + 2as< 1, 

1-2k^ 'mm 



P max log 07 - log oj. > Cn K < pCi exp -C 2 n 
\j€{l,2,...,p} ' J ■■ Ji 



< Ci exp -C 2 C z n 



-<2 1-2K T min 



+ n 



0, 



for any constant C > as n — > 00. Similarly, 

H 



h=l 



< 



P max 
\je{i,2,...,p} 

H ( 

y P ( max 



X] ^ ( lo § 



O" 



(/') 



log 



O" 



CO 



log 



a 



(h) 



-log 



a 



(h) 



> 



>CrT K 
CrT^ 



s h H 



.C 2 r n 



AH 2 



^ tt rt I n \—2k^ 'mm 

< HpCi exp I — C2n 

< HCi exp (-C 2 C 2 n 



2„l-2re r min ^ 7 



4# 2 



for any constant C > as n — > 00. Thus, 



P I max 

v jG{l,2,...,p} 



Z?*-D* 



>Cn" K 



0, 



for any constant C > as n — > 00. 

s(-y.j|g(y 

r(X,|S(Y 
Vax(E(X,-|S(y))) 



Second, note that if E(Var(x J 5(y))) — ^ in K ' then for sufficiently large n, 



nlog 



1 + 



E(Var(^|S(F))) 



~ 2 
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When all the relevant predictors indexed by A are independently detectable, 
there exists c = min{^-, ^} > such that 



D* > 2cn~ K , for j G A. 



The events < mhijg^ D*, < en K > C < maxj e | 12j ^ 



p\ 



D*-D* 



P ( min D*i < en 

\jeA ° 



< P | max 

je{i,2,..., P } 



D*-D* 



> en K > , and 

> en " I -> 0, 



as n — > oo. 

(b) We denote M c = {j : D* > cnT K , for 1 < j < p} and Mc = {j : 
D* > §re~ K , for 1 < j < p}. We will prove that there exists C > such that 



M £ 



< Cn K+ri . Since 



M £ 



^ K <EA* 



i=i 



we just need to prove that 



J2 D* < Cn^, 



for some positive constant C. First, 
- x | \ a r(E (X 3 \S(Y))y 



E 1 ^ 



E(Var(X,|S(F))) 



< 



< 



E 



Var(E (A^gCr))) 
-E(Var(X,|S(y))) 



— ^Var(E(^|5(y))) 



i=i 



and 
p 

^ Var (E (XjlSiY))) = £ Var (E (X^Y))) + £ Var (E (X,\S(Y))) . 
j=i jeA c jeA 

Under model (2.7), E (X A c\S(Y)) = a + /3 T E (X A \S{Y)). Since 

Y^ Var(E(X i |5'(y))) =trace(Cov[E(X^e|S(y))]) 
j&A c 

= trace(/3 T Cov[E(X^|,S(y))]/3) 

< A max (/3 T /3) trace (Cov [E (X^|S(y))]) 

= A max (/3 T /3) Y, Var (E {X j \ S{Y))) . 

3&A 
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and 

Amax (/3 T /3) 

A max (Cov(X^c,X^|5(F)) [Cov(X^|5(F))]- 1 Cov(X^ c ,X^| < S(y)) T 
" A min (Cov(X^|5(y))) 

A m ax(Cov(X^c|5(y))) T max 

A min (Cov(X^|5(y))) ~~ r min ' 
we have 

EM 1 + ™S£ii 1 * ^X> (E( w») 



E(Var(Xj|5(y))) 



Tmin . .. 



'min \ 'min / . . 

Second, for j G A c , Var(Xj\~X.j^, S(Y)) is a constant, and 
log [E (Var(Xj|S(y)))] - E [log (Var(X,|S(y)))] 

< _L E [Var (E(X,-|X.a, S(y))|S(y))] . 



E[Vax(E(X J -|X^,g(y))|5(y))] ^ 1 



VaxpTj-IX^SpO) 

So 

2 (log [E (Var(X,-|5(y)))] - E [log (Vax(X,-|5(y)))]) 



je.4 c 



< — E E[Var(E(^|X x S(y))|5(y))] 

T min ie ^ c 

1 -trace {E[Cov (E(X A c\X A , S(Y))\S(Y))}) 



r, 
< 



mm 

' trace (/3 T E [Cov (X^|5(r))] /3 2 



— A^ (/3 T )9)X;e [Var (Xj-ISCy))] < ^ J> [Var (^^(y))] . 

T min ie ^ T min ^ 

Therefore, 

£ (log [E (VaipySpO))] - E [log (Var(X j |5(y)))]) 



< — (l + ^)£E[Var(A-|S(y))]. 

''"min \ Tmin / . . 
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Moreover, Y^jeA Var ( x j\ s ( Y )) ^ l^l r ma X < T max £, n v , and 

Var(E(X,|5(y)))- 



E^ = E( 10 

3=1 i=i 



1 + 



E(Var(X,|S(y))). 
+ log [E (Var(X,|,S(y)))] - E [log (Var(X J |S(y)))]) 



< 



TV, 



1 + 



I TT 



J>ax(X,-|S(r)) 



< 



je.4 



1 + 



Con". 



Thus, there exists C > such that 
p 



Mc 

2 



3=1 



Then, 



and the event 



P 

M, 



M, 



> Cn K+v < P 



Mr 



> 



Mc 



> 



Me 



C < max je{li2v .. iP } 



D*-D* 



M f 



| C J3j such that D* < § n~ K and D* > cn" K | 
>. Thus, according to the results in (a), 

^ r. \ 

■+0, 



>|n- 



>Cn K+r >) <P| max 

je{i,2,..., P } 



d;-u; 



> — n 



as n — )• oo. 



D 



A. 7. Proof on Choices of Slicing Schemes in Section 3. Suppose 
(S\,S2, ■ ■ ■ ,Sh) is the true slicing scheme under model (2.2) or (2.7), and 
S(Y) denotes the slice membership of the slicing scheme, i.e., S(Y) = h 
if Y & Sh- We say that a slicing scheme S(Y) is a refinement of S(Y), 
which is denoted by S(Y) ^ S(Y), if there exists a function g such that 
S(Y) = g(S(Y)). 

For any slicing scheme S(Y), as n — , oo, the limit of the likelihood-ratio 
test statistic under model (2.2) is given by 



D j\C,S(Y) 



}Too D J\C,S(Y) 



Cov (Xj) - Cov (Xj,Xc) [Var (X C )P Var {X h X c ) 



log 

log E f Var lXj | S(Y)\) - E (Cov (X,-,X C | 5(V)J 

E (Cov f Xc | 5(y) 



x 



E(Cov(x,-,X c | 5(V) XN ' 
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and the limit of the augmented likelihood-ratio test statistic under model 
(2.7) is given by 

j\C,S(Y) ~ n=?L J\C,S(Y) 



log 



Var (Xj) - Cov (X,-, X c ) [Var (Xc)] -1 Cov (X,-, X c ) 



E (log Var (Xj \ S(Y)\ 



-1 



Cov X j} X c I S(Y) Cov X c I S(Y) Cov X,-, X c I S(Y) 



For the true slicing scheme S(Y) or a slicing scheme S(Y) that is a re- 
finement of S(Y), i.e., S(Y) ^ S, under model (2.2), we have 

D j\C,S(Y) = D j\C,S(Y) 



log 



Var (Xj) - Cov (X j} X c ) [Cov (X c )]~ x Cov (Xj, Xc) 1 



-logiVariXj \X C ,S(Y))), 

where Var(Xj | Xc, S(Y)) is a constant that does not depend on Xc or S(Y). 
Similary, under the augmented model (2.7), 

j\C,S(Y) ~ U j\C,S{Y) 



log 



Var {Xj) - Cov (Xj, Xc) [Cov (Xc)f 1 Cov (Xj, Xc) 



-E (log (\ a T(Xj\Xc,S(Y)))). 

For a slicing scheme S(Y) that is "coarser" than the true slicing scheme 
S(Y), i.e., S(Y) X S(Y), we have the following theorem. 

Proposition 6. Suppose S(Y) is a slicing scheme such that S(Y) ^ 
S(Y), where S(Y) is the true slicing scheme. 

(a) Under model (2.2), 

D j\C,S(Y) > D j\ C ,S{Yy 

where the equality holds if A C C, where A is the index set of relevant 
predictors. 

(b) Under model (2. 7), 

u j\c,s<y) Z U j\ C ,s{Yy 

where the equality holds if A C C, where A is the index set of relevant 
predictors. 
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Proof of Proposition 6. (a) Since 

Var (Xj | S(Y)) = E (Var (Xj | X c , S(Y)\ \ S(Y) 

+Vax(E(x j \X c ,S{YJ)\S(Y)). 

we have, 

E (Var (Xj | S(Y))) - E (Cov (Xj, X c | S(Y) 
e(Cov(x c \ S(Y). , _„ r v„ ■ .,, 
E (Var (Xj | X c , S(Y)\\ + E Var (E (Xj | X c , 5(V.) ) j .S'( Y } 

E f 

E 



Efcovfxj-.Xci 5(y) NX ' 



Cov (E (X,- | X C , S(Y)j , X c | 5(y)J j [E (^Cov (X c | 5(y)J 

Cov (e (xj | x c , s(Y)) , x c | 5(y) x 



-1 



> E (Var (Xj \ Xc,S(Y) 



where the equality holds if E I Xj | Xc, S(Y) I is a linear combination of Xc 

that does not depend on S(Y). Since S(Y) < S(Y), the cr-algebra cr(S(Y)) C 
ff(5(y)). Thus, 



Var ( Xj | X c , 5(y) ) = E (Var (Xj | X c , S(V)) X c , S(Y)^ 

+Var (E(Xj I Xc,5(y))|Xc,5(y)J 
> E (Var (Xj | X c , S(Y)) X c , S(Y)' 



and 



E (Var (Xj | Xc, S(Y))) > E (Var (Xj | X c , S(Y))) , 



where the equality holds if E (Xj | Xc, S(Y)) is a linear combination of Xc 
that does not depend on S(Y). Because S(Y) is the true slicing scheme, un- 
der model (2.2), Var (Xj | Xc, S(Y)) is a constant that does not depend on 
X c or S(Y), that is, E (Var (Xj j X c , S(Y)j) = Var (Xj | X c , S(Y)). There- 
fore, 



E Var [Xj\ S(Y)\) -E Cov (X j ,X c \ S(Y) 



E(Cov(X i ,X c | S(Y) 



E (Cov I X C | S(Y) 

> Var (X, | X c ,5(y)) 
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and 



D 



J\C,S(Y) 



log 



Var (Xj) - Cov (Xj,X c ) [Cov (X c )] _i Cov (Xj,X c 



\T 



log (Vax(X,- 1 Xc,5(y))) 



> log 



Var (Xj) - Cov (Xj, X c ) [Cov (Xc)]" 1 Cov (X,-, X c ) 



log E Var Xj \ S(Y) - E Cov [X jt Xc \ S(Y) 



E Cov X c | S(Y) 



E Cov (Xj,Xc\S(Y) 



~ D j\C,S(Y)- 

When A C C, under model (2.2), the predictor indexed by j E C c has the 
same conditional distribution across difference slices given Xc. The condi- 
tional expectations E (Xj | Xc, S(Y)) and E ( Xj | Xc, S(Y) I are linear com- 
binations of Xc that do not depend on S(Y) or S(Y). Thus, the equalities 
hold in this case. 

To prove (b), note that 

Var (Xj | S(Y)) - Cov (Xj, Xc \ S(Y)) Cov (x c | S(Y)) ~ x 
Cov ( y X J ,X c \S(Y)Y 

= E (Var [Xj | X c , S(Y)j | S(Y)\ + Var (E (Xj | X c , S(Y)) | 5(F) 

-l 



Cov (E {Xj | X c , 5(y) J , X c | 5(y) J [Cov ( x c I S(Y) 
Cov (E (Xj I X c , 5(F)) , X c I £(V) X " 
> E (Var (Xj | X c , S(Y)) \ S(Y) 



where the equality holds if E I Xj | Xc, S(Y) I is a linear combination of Xc 

that does not depend on S(Y). Since S(Y) X S^V), the cr-algebra o^SfV)) C 
a(S(Y)). Thus, 



Var ( Xj | X c , 5(F) ) = E (Var (V, | X c , S(Y)) X c , 5(V)^ 

+Var (E(Xj I Xc,5(y))|Xc,5(V)J 
> E (Var (Xj | X c , 5(F)) X c , S(Y)] 
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Var [Xj | S(Y)\ - Cov ( Xj, X c | S(Y)\ Cov (X c | S(Y) 

Cov(x i ,x c |5(y)) T 

> E (Var (Xj | X c , S(Y)) S(Y)) > E (Var {X 3 | X c , S(Y)) S(Y)) , 



where the equalities hold if E(X,- \ X C ,S(Y)) and E lXj \ Xc,S(Y)J are 

linear combinations of Xc that do not depend on S(Y) or S(Y). According 
to Jensen's inequality, 



E Var (X | X c , S(Y)) S(Y) > E log [Var (X | X c , S(Y))} S(Y) . 



log 



where the equality holds if Var (Xj | Xc, S(Y)) is a constant that does not 
depend on Xc or S(Y). Therefore, 



E log 



Var (Xj | S(Y)) - Cov (Xj,X c | S(Y)\ x 
Cov (Xc | S(Y)) _1 Cov (Vj, X c | S(Y^ ' 

E(v&r(Xj\Xc,S(YJ)\s(Y) 

> E (E (tog [Var (X,- | X c , S(Y))\ \ S(Y) 
= E(log[Vzr(Xj\X e ,S(Y))}), 



> E to 



and 



^IC.SOO 



log 



Var (Xj) - Cov (Xj, X c ) [Cov (X c )] _1 Cov (X,-, Xc 



\T 



E (log [Var (Xj | Xc,5(y))]) 



> l0£ 



Var (Xj) - Cov (X,-, X c ) [Cov (Xc)]^ 1 Cov (X,-, X c 



E log 



Var (Xj | S(Y)\ - Cov (x^Xe | S(Y)\ x 



Cov f X c | S(Y)\ Cov ( Xj, X c | S(Y) 

= D* ~ 
j\e,s(Yy 

When A C C under the augmented model (2.7), the predictor indexed by 
j £ C c has the same conditional distribution across difference slices given 
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X(j. So E (Xj | Xc,S'(y)) and EIXj| ~Kc,S(Y)) are linear combinations 

of Xc that do not depend on S(Y) or S(Y), and Var (Xj | ~Kc,S(Y)) is a 
constant that does not depend on X<j or S(Y). Thus, the equalities hold in 
this case. □ 

ACKNOWLEDGEMENTS 

We thank Wenxuan Zhong for sharing the mouse embryonic stem cells 
data set, Tingting Zhang for helpful discussion on the COP procedure, Runze 
Li and Wei Zhong for providing the R code for DC-SIS. 

REFERENCES 

Bien, J., Taylor, J. and Tibshirani, R. (2012). A Lasso for hierarchical interactions. 

arXiv preprint arXiv:1205.5050. 
Chen, C.-H. and Li, K.-C. (1998). Can SIR be as popular as multiple linear regression? 

Statistica Sinica 8 289-316. 
Chen, X., Xu, H., Yuan, P., Fang, F., Huss, M., Vega, V. B., Wong, E., 

Orlov, Y. L., Zhang, W., Jiang, J. et al. (2008). Integration of external signal- 
ing pathways with the core transcriptional network in embryonic stem cells. Cell 133 

1106-1117. 
Cloonan, N., Forrest, A. R., Kolle, C, Gardiner, B. B., Faulkner, G. J., 

Brown, M. K., Taylor, D. F., Steptoe, A. L., Wani, S., Bethel, G. et al. (2008). 

Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nature Methods 

5 613-619. 
Cook, R. D. (2004). Testing predictor contributions in sufficient dimension reduction. 

The Annals of Statistics 32 1062-1092. 
Cook, R. D. (2007). Fisher lecture: Dimension reduction in regression. Statistical Science 

22 1-26. 
Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. 

The Annals of Statistics 32 407-499. 
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its 

oracle properties. Journal of the American Statistical Association 96 1348-1360. 
Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature 

space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70 

849-911. 
Friedman, J., Hastie, T., Hofling, H. and Tibshirani, R. (2007). Pathwise coordinate 

optimization. The Annals of Applied Statistics 1 302-332. 
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C, Gaasenbeek, M., 

Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A. et al. 

(1999). Molecular classification of cancer: class discovery and class prediction by gene 

expression monitoring. Science 286 531-537. 
Li, K.-C. (1991). Sliced inverse regression for dimension reduction. Journal of the Amer- 
ican Statistical Association 86 316-327. 
Li, L. (2007). Sparse sufficient dimension reduction. Biometrika 94 603-613. 
Li, L., Cook, R. D. and Nachtsheim, C. J. (2005). Model-free variable selection. Journal 

of the Royal Statistical Society: Series B (Statistical Methodology) 67 285-299. 



imsart-aos ver. 2011/11/15 file: siri.tex date: April 16, 2013 



60 JIANG AND LIU 

Li, R., Zhong, W. and Zhu, L. (2012). Feature screening via distance correlation learning. 

arXiv preprint arXiv: 1205. 4701. 
Miller, A. J. (1984). Selection of subsets of regression variables. Journal of the Royal 

Statistical Society: Series A (General) 389-425. 
Murphy, T. B., Dean, N. and Raftery, A. E. (2010). Variable selection and updating 

in model-based discriminant analysis for high dimensional data with food authenticity 

applications. The Annals of Applied Statistics 4 396. 
Ouyang, Z., Zhou, Q. and Wong, W. H. (2009). ChlP-Seq of transcription factors 

predicts absolute and differential gene expression in embryonic stem cells. Proceedings 

of the National Academy of Sciences 106 21521-21526. 
Simon, N. and Tibshirani, R. (2012). A permutation ppproach to testing interactions in 

many dimensions. arXiv preprint arXiv: 1206.6519. 
Szretter, M. E. and Yohai, V. J. (2009). The sliced inverse regression algorithm as 

a maximum likelihood procedure. Journal of Statistical Planning and Inference 139 

3570-3578. 
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the 

Royal Statistical Society: Series B (Methodological) 267-288. 
Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple 

cancer types by shrunken centroids of gene expression. Proceedings of the National 

Academy of Sciences 99 6567-6572. 
Wang, H. (2009). Forward regression for ultra- high dimensional variable screening. Jour- 
nal of the American Statistical Association 104 1512-1524. 
Zhang, Y. and Liu, J. S. (2007). Bayesian inference of epistatic interactions in case- 
control studies. Nature Genetics 39 1167-1173. 
Zhong, W., Zeng, P., Ma, P., Liu, J. S. and Zhu, Y. (2005). RSIR: regularized sliced 

inverse regression for motif discovery. Biomformatics 21 4169-4175. 
Zhong, W., Zhang, T., Zhu, Y. and Liu, J. S. (2012). Correlation pursuit: forward 

stepwise variable selection for index models. Journal of the Royal Statistical Society: 

Series B (Statistical Methodology). 
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American 

Statistical Association 101 1418-1429. 

Department of Statistics 
Harvard University 
1 Oxford Street, Cambridge 
MA 02138, USA 
E-MAIL: bjiang@fas. harvard.edu 
jliu@stat.harvard.edu 



imsart-aos ver. 2011/11/15 file: siri.tex date: April 16, 2013 



